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Preface 



DISC, the International Symposium on Distributed Computing, is an annual 
forum for research presentations on all facets of distributed computing. DISC 
2000 was held on 4 - 6 October, 2000 in Toledo, Spain. This volume includes 
23 contributed papers and the extended abstract of an invited lecture from last 
year’s DISC. It is expected that the regular papers will later be submitted in a 
more polished form to fully refereed scientific journals. The extended abstracts 
of this year’s invited lectures, by Jean-Claude Bermond and Sam Toueg, will 
appear in next year’s proceedings. 

We received over 100 regular submissions, a record for DISC. These sub- 
missions were read and evaluated by the program committee, with the help of 
external reviewers when needed. Overall, the quality of the submissions was 
excellent, and we were unable to accept many deserving papers. 

This year’s Best Student Paper award goes to “Polynomial and Adaptive 
Long-Lived {2k — 1)-Renaming” by Hagit Attiya and Arie Fouren. Arie Fouren 
is the student author. 
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Faith Fich^ and Eric Rnppert^ 

^ Department of Computer Science ^ University of Toronto 
^ Department of Computer Science ^ Brown University 



1 Introduction 

What can be computed in a distributed system in which faults can occur? This 
is a very broad question. There are many different models of distributed systems 
and many different kinds of faults that can occur. Unlike the situation in sequen- 
tial models of computation ^ small changes in the model of a distributed system 
can radically alter the class of problems that can be solved. Another important 
goal in the theory of distributed computing is to understand how efficiently a 
distributed system can compute those things which are computable. There are 
a variety of resources to consider ^ including time^ contention ^ and the number 
and sizes of messages and shared objects. 

This paper discusses results that say what cannot be computed in certain 
environments or when insufficient resources are available. A comprehensive sur- 
vey would require an entire book. As in Nancy Lynch's excellent 1989 paper^ 
''A Flundred Impossibility Proofs for Distributed Computing'' [86] ^ we shall re- 
strict ourselves to some of the results we like best or think are most important. 
Our aim is to give you the flavour of the results and some of the techniques 
that have been used. We shall also mention some interesting open problems and 
provide an extensive list of references. The focus will be on results from the past 
decade. 

We begin in Sections 2 and 3 with a brief description of aspects of the models ^ 
terminology^ and problems that are discussed throughout the paper. The rest of 
the paper presents a wide variety of lower bound results. Section 4 describes the 
valency argument ^ a fundamental technique that has been adapted to prove lower 
bounds for many different models. One systematic approach to understanding 
the computational power of different models is to obtain efficient simulations of 
some models by others. This allows lower bounds derived in one model to be 
extended to other models. Some simulation techniques will be described in Sec- 
tion 5. Another systematic approach to studying computability for distributed 
systems is to characterize the models that can solve a particular problem. Some 
results along these lines will be discussed in Section 6. Alternatively^ one can 
characterize the set of problems solvable in a given model. Results of this type 
are also described in Section 6^ and more fully in Section 7^ which covers an 
important development of the past decade: the use of ideas from topology to 
prove lower bounds in distributed computing. Section 8 examines the question 
of whether weak shared object types can become more powerful when they are 
used in combination with other weak types. A number of techniques that have 



M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 1-28, 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 




2 



F. Fich and E. Ruppert 



been used to prove lower bounds on the complexity of solving problems are dis- 
cussed in Section 9. The last section contains some general remarks about the 
value of lower bounds. 



2 Models 



There are a number of excellent descriptions of distributed models of computa- 
tioUj including motivation and formal definitions [16^80^87]. Therefore^ we shall 
only briefiy mention some aspects of these models which are necessary for the 
results we present. 

There are different ways processes can communicate with one another. In 
message-passing models ^ processes send messages to one another through com- 
munication channels. In shared-memory models ^ processes communicate by per- 
forming operations on shared data structures^ called objects. The typewriter 
font is used to denote object types. Each type describes the set of operations 
that can be performed on an object ^ and the responses that it should return if 
it is accessed by one operation at a time. The most common type of object is 
the register^ which stores a value that can be read or written by all processes. 
Throughout this paper^ we assume objects are linearizable [64] ^ so operations 
that may actually run concurrently all appear to happen instantaneously in 
some order. 

Processes may run synchronously ^ so all processes take steps at exactly the 
same speedy or asynchronously^ where processes may run at arbitrarily varying 
speeds. The latter case is generally modelled by thinking of an adversary sched- 
uler that chooses the order in which processes take steps. Algorithms must work 
correctly regardless of the schedule the adversary chooses. 

Many different kinds of faults are considered in distributed systems. Processes 
may fail and perhaps recover ^ their states can become corrupted ^ or they can 
behave maliciously. Plnless otherwise noted ^ we assume that faulty processes 
fail by halting permanently. An algorithm that can tolerate up to t failures is 
called t-resilient A wait- free algorithm ensures that every non-faulty process 
will correctly complete its task even if any number of other processes fail. Thus^ 
it has no infinite executions. Communication channels can fail^ lose or delay 
messages^ or deliver them out of order. Shared objects can also be corrupted or 
fail to respond. 

An object type is deterministic if its response to each operation is uniquely 
determined by its previous history. Non- deterministic objects may have multiple 
possibilities. Algorithms must work correctly for all possible responses. Similarly^ 
in randomized algorithms^ a process may have many choices for its next step^ 
but the choice is made according to some probability distribution. Generally^ for 
randomized algorithms ^ termination is required only with high probability and 
one considers expected time^ rather than worst-case time. 
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3 Consensus and Related Problems 

The consenstts problem^ first studied by Pease^ Shostak and Lamport [81^93]^ is 
perhaps the most thoroughly investigated problem in distributed computing. It 
is simply stated and it is a primitive building block for many distributed systems. 

Consensus is an example of a decision task^ in which each process gets a 
private input value from some set and must eventually terminate after having 
produced an output value. The task specification describes which output values 
are legal for given input values. For consensus ^ there are usually two correctness 
properties that must be satisfied: 

Agreement: the output values of all processes are identical ^ and 

Validity: the output value of each process is the input value of some 
process. 

In the binary eonsensas problem^ all input values come from the set {0^ 1}. Many 
variants of the consensus problem have been studied for a variety of different 
models of distributed computing [41^86]. 

There are easy reductions from consensus to other problems^ such as leader 
eleetiorr In this problem ^ there are no inputs ^ exactly one process (called the 
leader) must output b all other processes must output 0. Once processes 
have elected a leader ^ they can solve consensus by using the leader’s input value 
as their common output value. Thus^ lower bounds for consensus give lower 
bounds for leader election and other problems. 

Consensus is a good candidate problem for a systematic study of computabiL 
ity^ since Herlihy [53] showed that it is universal: a system equipped with 
registers and objects that can solve wait-free consensus can implement any 
other object type in a wait-free manner. 

Objects types can be classified according to the ability of a shared-memory 
distributed system to solve consensus using objects of that type. Specifically^ the 
consensus number cons{T) of a set of object types T is the maximum number of 
processes for which consensus can be solved using objects in T and registers 
[53^66]. Then cons({T}) < cons({T^}) implies that cannot be implemented 
in a wait-free manner from objects of type T and registers. It follows from 
this observation and Herlihy ’s universality result that this classification ^ called 
the consensus hierarchy^ gives a great deal of information about the power of 
different models of asynchronous ^ shared-memory systems. 

However j the consensus number of an object type does not say everything 
about the power of a shared-memory model which provides objects of that type 
and registers. For example^ there are object types T and with consensus 
numbers 1 and respectively^ such that 2-set consensus for 2n+ 1 processes can 
be solved using objects of type T and registers^ but not using only objects 
of type and registers [96]. The k-set consensus problem^ introduced by 
Chaudhuri [34] ^ is similar to the consensus problem ^ but relaxes the agreement 
property. Instead of requiring that all output values be identical ^ it requires that 
the set of output values produced has cardinality at most fc. Thus^ consensus is 
a special case of fc-set consensus ^ with fc = 1. 
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4 Valency Arguments 

The valency argument has become the most widely-nsed technique for impossi- 
bility proofs in distributed computing. It was introduced by Fischer ^ Lynch and 
Paterson [46] to prove that 1-resilient (and^ hence ^ wait-free) consensus is im- 
possible in an asynchronous message-passing system. Loui and Abu-Amara [84] 
and Herlihy [53] adapted the valency argument to show impossibility results for 
several asynchronous shared-memory models. 

We shall give an outline of these proofs and then mention some other ex- 
amples of valency arguments. Valency arguments also play a supporting role in 
many of the results surveyed in other sections of this paper. 

A configuration is a “snapshot'' of a distributed system during the execution 
of an algorithm: it consists of the state of every process and the environment 
(messages in transit for a message-passing system ^ or states of all shared objects 
for a shared-memory system) . A configuration of a consensus algorithm is called 
imivalent if every possible execution continuing from that configuration gives the 
same output value^ and nvaltivalent otherwise. In other words ^ from a multivalent 
configuration j there are two or more executions that produce different outputs. 

There are three parts to the valency argument that 1 -resilient consensus is 
impossible in an asynchronous system. The first is the observation that any 
configuration where some process has produced an output is univalent ^ by def- 
inition. Secondly, the validity condition of the consensus problem can be used 
to show that any consensus algorithm has a multivalent initial configuration. It 
follows that any consensus algorithm must have a critical configuration: a mul- 
tivalent configuration where a single step by any process will move the system 
into a univalent configuration. (Otherwise, one could construct an infinite execu- 
tion containing only multivalent configurations where no process ever produces 
an output.) The third part of the argument shows that a critical configuration 
cannot exist. Assuming that such a configuration does exist, one can derive a 
contradiction using a case argument that considers the possible pairs of steps sq 
and Si that could be taken from the critical configuration to lead to univalent 
configurations with different decision values. For example, if the two steps sq and 
Si involve message channels with different destinations or access different shared 
objects, then performing sq followed by Si leads to the same configuration as 
doing the two steps in the reverse order. This contradicts the fact that the two 
steps lead to different decision values. The other cases show that configurations 
obtained by taking one or both of these steps are either identical or differ only in 
the state of one process, which can then be permanently halted by an adversarial 
scheduler. 

Attiya, Dwork, Lynch, and Stockmeyer [12] used valency arguments to give 
lower bounds on the time required to solve consensus in a semi-synchronous 
message-passing model, where messages are delivered within time d and there 
is a bound, r, on the ratio of process speeds. They proved that the worst-case 
running time of a ^-resilient consensus protocol is at least (r + ^ — l)d. Alur, 
Attiya and Taubenfeld [5] considered semi-synchronous models where processes 
communicate using shared registers. They proved that each process requires 
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where U is an (unknown) upper bound on the time between 
steps of a process. To do this^ they carefully assign times (consistent with the 
parameter U) to the steps of any sufficiently long asynchronous execution. 

Taubenfeld and Moran [106] used a valency argument to provide a general 
impossibility result for a large class of problems in the asynchronous shared 
register models in the case where failures can occur and in the more benign 
case where faulty processes do not take any steps. (An earlier paper [105] proved 
similar results for message-passing systems.) 

Recently, Moses and Rajsbaum [91] gave a unified framework for proving 
lower bound results, based on the valency argument, that applies to both message- 
passing and shared-memory systems, and for synchronous and asynchronous 
schedulers. Their approach is to restrict the adversary scheduler to a nicely 
structured subset of the possible executions. (Some earlier work by Lubitch 
and Moran [85] used a similar approach.) For example, Moses and Rajsbaum 
considered computation using single-writer registers, restricted types of 
registers to which only a single, fixed process can write. They showed that 
consensus is impossible if processes communicate using these objects, even when 
they are guaranteed to be scheduled in slightly asynchronous rounds where, in 
each round, at least n — 1 processes write to a register and then read the values 
written in that round by at least n — 1 processes. 

This is an instance of an important observation: even though impossibility 
results and lower bounds with restricted adversaries are stronger (i.e. they imply 
the same lower bound against a more general adversary), they may be easier 
to understand and have more elegant proofs (because there are fewer cases to 
consider) . Such proofs also help us identify which aspects of the problem or model 
make the problem unsolvable. The key to such proofs is coming up with the right 
adversary. One must discard any unnecessary complications while ensuring that 
the adversary is still strong enough to prove the impossibility result. 

Valency arguments have been generalized in several ways. The definitions of 
uni valence and multi valence have been adapted to fit other models and problems 
as will be seen in Sections 6, 8 and 9.4. When valency arguments are used for 
other types of faults, it is necessary to adapt the way in which the scheduler con- 
ceals evidence about the first step taken after a critical configuration. It suffices 
to construct, from any multivalent configuration, two successor configurations 
which are indistinguishable to some process, yet can lead to executions produc- 
ing different outputs. This approach is used by Jayanti, Chandra and Toueg [70] 
to prove the impossibility of certain implementations in a model where objects, 
instead of processes, fail. Related work by Afek, Greenberg, Merritt and Tauben- 
feld [2] shows that consensus is impossible when processes communicate with any 
types of objects, if half the processes can fail and the states of half the objects 
can become corrupted. 

Lo [82] used a valency argument, adapted to deal with non-deterministic 
objects, to prove that there is an object type with consensus number 1 which can 
be used together with registers to solve 1-resilient consensus for n processes. 
In contrast, for ^ > 1, Chandra, Hadzilacos, Jayanti, and Toueg [32] proved that 
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an object type has consensus number greater than t if and only if it can be used 
together with registers to solve Rresilient consensus for n> t processes. 

Dworkj Herlihy and Waarts [42] use a valency argument to obtain a lower 
bound on the contention of any wait-free consensus algorithm that uses shared 
memory. 

Valency arguments have also been used to study quantum-based schedulers ^ 
where the scheduler guarantees that processes will run for a certain length of 
time (called a quantum) without being pre-empted by other processes running 
on the same processor. They give lower bounds on the size of quantum necessary 
for n-process consensus to be universal in a system of n processors executing any 
number of processes [6]. 

5 Simulations 

Simulations provide a way of extending lower bounds to different settings. For 
example^ suppose that one system A can simulate another system A\ Then^ 
if a problem has been proved impossible in A^ it follow that the problem is 
impossible in A^ too. Similar ly^ lower bounds in A imply lower bounds in A\^ 
although the bounds obtained for A^ may be smaller ^ depending on the efficiency 
of the simulation. In this section ^ we survey some of the simulations that have 
been useful for establishing such results. 

The Borowsky-Gafni (BG) simulation [2V24] describes how a system of n 
processes can simulate an algorithm designed for a system with rri processes. 
They consider asynchronous systems where processes communicate using regi- 
sters and up to t processes may fail. This important technique has been used 
and extended by others [36^62^96]. 

In the following brief description ^ we call the simulated processes threads 
and the simulating processes processors^ to improve clarity. A key element of 
the simulation is the safe agreement subroutine. It satisfies the agreement and 
validity properties of the consensus problem^ but might not terminate. However ^ 
if processors are running several copies of the safe agreement routine in paral- 
lel ^ a processor failure can prevent at most one of the copies from terminating. 
Although the BG simulation technique is applicable more generally^ we consider 
how to simulate an algorithm with m threads for the fc-set consensus problem ^ 
defined in Section 3. Every processor simulates the steps of every thread. This is 
done in parallel for all processors and threads. The processors use safe agreement 
to ensure that a simulated step has the same result in the simulations carried out 
by different processors. Each processor visits the threads in round-robin order 
and tries to execute the next step of each thread it visits. Whenever a processor 
observes that a thread has terminated ^ the processor can also terminate ^ using 
the output of the thread as its own output. Since the threads will output at most 
k different values ^ it follows that this simulation provides a correct solution to 
the fc-set consensus problem. The safe agreement routine is designed so that a 
processor failure will block the simulation of at most one thread. This ensures 
thatj if the original m-process algorithm was Gresilient^ then the n-process al- 
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gorithm constructed will also be t-resilient. As described in Section 7.2^ there is 
no wait-free (i.e. fc-resilient) fc-set consensus algorithm for 1-processes. Thus^ 
the BG simulation implies that there is no fc-resilient fc-set agreement algorithm 
for m > k processes. 

Jayantij Chandra^ and Toueg [70] used a different simulation to prove that 
there is no algorithm to solve consensus for two processes in an asynchronous 
shared-memory system in which at most one object can fail by delaying its re- 
sponses forever. To do this^ they showed how m + 2 processors that communicate 
using non-faulty registers could simulate a consensus algorithm for two threads 
that uses m objects and tolerates one object failure. The actions of each thread 
and object are simulated by a different processor. The resulting algorithm would 
solve 1-resilient consensus using only registers^ which is impossible [84]. 

Afek and Stupp [4] obtained complexity results from computability results 
by using a simulation. They considered the problem of electing a leader in a 
system of n processors using registers and one compared swap object that can 
store one of v different values^ and proved that some process must take f?(log^ n) 
steps. Given such a leader election algorithm in which no process ever takes more 
than d steps^ they showed how [n / {d + 1)J processes can simulate it^ using only 
registers, and thereby solve (c ^ l)^-set consensus. In the simulation, different 
processes may actually simulate different executions of the leader election algo- 
rithm. However, the number of different simulated executions is at most (i? — 1)^. 
The lower bound on d follows from the fact that registers alone cannot solve 
set consensus when (c ^ 1)^ < [n/(d+ 1)J (see Section 7.2). 

6 Deciding When Problems are Solvable 

Because there are so many different models of distributed systems, it would be 
useful to have a general technique to determine whether a given model can solve 
a given problem, or implement a given data structure. Unfortunately, this is not 
possible in general. 



6*1 Undecidability 

Jayanti and Toueg [73] proved that there is no algorithm that, given the de- 
scription of a type and an initial state, determines if it can be implemented 
from registers. They use a reduction from the halting problem. Given a (de- 
terministic) Turing machine M, they construct a type T(M) whose state stores 
a configuration of M and a boolean fiag. Suppose the object is initially in a 
state corresponding to M's initial configuration on a blank input tape and the 
boolean fiag is set to false. The type T(Af) is equipped with a single operation. 
The operation updates the configuration stored in the state by simulating one 
step of M and returns 0 as long as M has not halted. The first operation applied 
to T(M) after the simulated machine M has halted sets the fiag to true and 
returns 1. Any operation on T(Af) after the fiag is set returns 2. 
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If M halts on a blank tape^ then T(M) can be used to solve leader election 
(and hence consensus) for two processes: each process repeatedly accesses the 
object until it returns a non-zero value and the process that receives the value 1 
becomes the leader. This means registers cannot implement T(Af). However^ 
if M never halts on a blank input tape^ then T(Af) can be implemented using a 
register initialized to 0 and having each operation applied to T(Af ) replaced by 
a read of this register. It follows that one cannot decide whether the type T(M) 
can be implemented from registers. A similar construction can be used to show 
that the consensus number of a given type is undecidable. Further undecidability 
results are described in Section 7.3. 

6*2 Decidability of Consensus Numbers 

For some natural classes of types ^ decision procedures for consensus number do 
exist. They follow from theorems that characterize types in the class in terms of 
their consensus number. 

One such class consists of the read-modify- write (RMW) object types [76]. 
A RAIW operation updates the state of the object by applying some function ^ 
and returns the old value of the state. For example ^ the test&set operation is a 
RAIW operation that applies the function f{x) = R and fetch&add applies the 
function f{x) = x+1. Other RAIW operations include read and compare&swap. 
A RMW type is one where all permitted operations have this form. Ruppert [98] 
gave a characterization of the RMW types that can solve consensus among n 
processes. The characterization uses a restricted form of the consensus problem^ 
called team consensus ^ where processes are divided into two teams and all pro- 
cesses on the same team receive the same input. A RAIW type T has consensus 
number at least n if and only if there is an algorithm for solving team consen- 
sus among n processes in which every process performs exactly one step on an 
object of type T. A valency argument was used to show the necessity of this 
condition: by examining the behaviour of processes as they each take their first 
step after the critical configuration of a consensus algorithm ^ one can obtain the 
required one-step algorithm for team consensus. For finite types ^ this condition 
is decidable. 

A similar characterization was also given for readable types [98] ^ which allow 
processes to read the state of the object without changing it. Together ^ these two 
classes of objects contain many of the common shared-memory primitives. These 
characterizations were used to prove impossibility results for consensus in the 
multi-object model [99] (where processes can access more than one shared object 
in a single atomic action) ^ following the work of Afekj Alerritt and Taubenfeld [3]. 

Recently Herlihy and Ruppert [62] gave a characterization of one-shot types 
that can solve wait-free consensus among n processes. (A one-shot type is one 
that can only be accessed once by each process.) This characterization ^ too^ is 
decidable for finite types. 

A natural open question is to obtain an algorithm that decides the consensus 
number of any type with finite state set. An interesting special case would be to 
consider non-deterministic RMW and readable types. One might also be able to 
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gain a better understanding of the relative power of different types by studying 
their ability to solve other problems such as set consensus and consensus- with- 
reset. 



6*3 Characterizing Solvable Tasks 

A related question is to determine whether a given problem (from some class) 
is solvable in a particular fixed model. Here^ the approach has been to find 
characterizations of the solvable problems. 

In the asynchronous message-passing models Biran^ Moran and Zaks [18] ^ 
building on earlier work by Moran and Wolfstahl [90] ^ gave a combinatorial 
characterization of the decision tasks that can be solved 1-resiliently in an asyn- 
chronous message-passing system. This characterization ^ described below j is in 
terms of the task's input and output vectors, with one coordinate corresponding 
to each process. Suppose there is a 1-resilient message-passing algorithm to solve 
a given task for n processes. Let G{x) denote the set consisting of all output 
vectors produced by the algorithm with input vector a?. First, for each input vec- 
tor a?, they consider the similarity graph with vertex set G{x) and edges between 
any two vectors that differ in exactly one coordinate. They prove this similarity 
graph is connected using a valency argument with slightly different definitions: a 
configuration C is univalent if all executions from C lead to an output vector in 
the same connected component and multivalent otherwise. Secondly, they show 
that, if I is any set of input vectors that differ only in coordinate j, then there 
is a set of output vectors, one from G{x) for each a? € J, that differ only in co- 
ordinate j. This follows from consideration of those executions in which process 
Pj is non-faulty, but takes no steps until all other processes have produced an 
output. 

Conversely, suppose there is a task for n processes such that, there is a set 
G{x) of allowable output vectors for each input vector a?, which has the following 
two properties: the similarity graph with vertex set G{x) is connected, and if I 
is a set of input vectors that differ only in coordinate j, then there is a set of 
output vectors, one from <T(a?), for each x € G differ only in coordinate j. 
Then Biran, Moran and Zaks proved that there is a 1-resilient message-passing 
algorithm to solve the task. In later papers, they also showed that determining 
whether a task has these properties is NP-hard for more than two processes [19], 
and gave very precise bounds on the round complexity of solving any task that 
satisfies them [20] . 

Attiya, Gorbach and Moran [13] gave a simple characterization of the tasks 
that are solvable in systems where asynchronous processes have no names, run 
identical programmes, do not know how many processes are in the system, and 
communicate using registers. The characterization (and the proof of its ne- 
cessity) is similar in fiavour to the results by Biran, Moran and Zaks, described 
above. Other impossibility results for systems with anonymous processes appear 
in [39,72]. 

Chor and Nelson [38] studied interactive tasks, where each process receives a 
sequence of input values and must produce the output value corresponding to its 
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current input value before being given its next input value. They characterized 
the interactive tasks that can be solved in an asynchronous system if a consensus 
subroutine is available. Their conditions ensure ^ among other things ^ that the 
set of allowable output vectors does not depend on the value of input values 
which have not yet been received. Note that the specification of interactive tasks 
do not necessarily ensure linearizabilityj so HerlihyN universality result [53] does 
not apply. 



r Topology 

Perhaps the most interesting development in the theory of distributed comput- 
ing during the past decade has been the use of topological ideas to prove results 
about computability in fault-tolerant distributed systems. Other connections 
between topology and distributed computing have been discussed in the litera- 
ture (see [50j51j74j97])j but the results described in this section represent new 
and powerful uses of topology in distributed computings particularly for proving 
lower bounds. 

7* 1 Simplicial Complexes 

We begin with some brief definitions of ideas from the topology of simplicial 
complexes. Several papers contain good introductions to the connections between 
distributed computing and simplicial complexes [47^57^60]. 

A d-dimensional simplex (or d-simplex) is a set of d + 1 independent vertices. 
Geometricallys the vertices can be thought of as (affinely) independent points 
in Euclidean space. A 0-simplex is a single pointy a 1-simplex is represented 
by a line segment ^ a 2-simplex is represented by a filled-in triangle ^ and so on. 
A (simplicial) complex is a finite set of simplexes closed under inclusion and 
intersection. The dimension of a complex is the maximum dimension of any 
simplex that appears in it. Examples of simplicial complexes appear in Eigure 1. 

A vertex can be used to represent the internal state (or part of the internal 
state) of a single process. A d-simplex whose vertices correspond to different 
processes represents compatible states of d+ 1 processes. As an example ^ consider 
the binary consensus problem for three processes^ P^Q and R. The possible 
starting configurations of an algorithm for this problem are shown in Figure 1(a). 
Each vertex is labelled by a process and the binary input value for that process. 
The complex consists of eight 2-simplexes arranged to form a hollow octahedron. 
Each 2-simplex represents one of the eight possible sets of inputs to the three 
processes. The corresponding output complex in Figure 1(b) shows the possible 
outputs for the binary consensus problem. In the upper 2-simplex ^ all processes 
output value 0^ while in the lower 2-simplex ^ all processes output value 1. Not 
all output simplexes are legal for every input simplex: by the validity condition 
of consensus j if all processes start with the input value 0^ then only the upper 
output simplex is legal. 
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Fig. 1. (a) Input complex and (b) output complex for three-process binary consensus 



More generally^ any decision task for n processes can be modelled in a similar 
way. The input complex I contains one {n — l)”Simplex for each possible input 
vector. The output complex O contains one simplex for each possible output 
vector. A map A that takes each simplex 5 of / to a set of simplexes in O 
(labelled by the same processes) defines which output vectors are legal for each 
input vector. 

Simplicial complexes are used as a means of describing whether processes 
can distinguish difierent configurations from one another. In that sense ^ they are 
similar to^ though more general than^ the similarity graphs of Biran^ Moran and 
Zaks [18] discussed in Section 6.3. Nodes in those graphs correspond to (n — I)- 
simplexes. The situation where two output vectors differ in only one coordinate ^ 
which is modelled by an edge in a similarity graphs is represented in the complex 
by having the two simplexes share 1 common vertices. Complexes can capture 
more information about the degree to which two configurations are similar: two 
simplexes that have d common vertices are similar to exactly d processes. This 
fact is useful in studying Cresilient algorithms in general ^ whereas similarity 
graphs are useful only for the case ^ = 1 . 

Consider a wait-free protocol for n processes that solves some task. One 
can define a corresponding (n — l)-dimensional protocol complex. Each vertex is 
labelled by a process and the state of that process when it terminates. Given any 
input vector and any schedule for the processes (as well as a description of the 
results of any coin tosses or non-deterministic choices) ^ the final state of every 
process is determined. This final configuration is represented by a simplex in the 
protocol complex. 

Each process must decide on an output value for its task based solely on 
its internal state information at the end of the protocol. This defines a decision 
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map 6 that takes each vertex of the protocol complex to a vertex of the output 
complex (labelled by the same process). Let 5 be a simplex of the protocol 
complex. Since 5 represents a configuration of compatible final states for some 
set of processes^ 6{S) must be a simplex of the output complex^ representing 
a compatible set of outputs for those processes. Furthermore ^ 6 must 'hespect'' 
the task specification: If S represents a configuration reached by some execution 
whose inputs come from the simplex I of the input complex^ then 6{S) must be 
in zi(J). 

The basic method of proving lower bounds using the topological approach 
can now be summarized. One uses information about the model to prove that 
any protocol complex has some topological property which is preserved by the 
map 5. The specification of the task is used to show that the image of 6 cannot 
have the property^ implying that no such map 5 can exist. 

For example j it can be shown that^ in the asynchronous model where pro- 
cesses use registers to communicate ^ any protocol complex (that begins from 
a connected input complex) is connected [63] . The connectivity property is pre- 
served by any map 6^ since 6 maps simplexes to simplexes. As shown in Figure 
the input complex for three-process binary consensus is connected. The image 
of 6 must include vertices in both triangles of the output complex ^ since the task 
specification requires that^ for any run where all processes get the same input 
value V, all processes output v. Thus the image of 6 is disconnected ^ and hence 
three-process binary consensus is impossible in this model. 

7*2 Set Consensus Results 

Much of the inspiration for the early topological impossibility results came from 
Chaudhuri [34] ^ who defined the fc-set consensus problem. She observed that 
Sperner's Lemma [104] ^ a tool often used in topology, could be applied to study 
the task. In papers that first appeared at STOC in 1993, three difierent groups 
of researchers [21,63, 100] used Sperner’s Lemma to prove that fc + 1 processes 
cannot solve wait-free fc-set consensus in an asynchronous model using shared 
registers. In addition to proving the result about set consensus, each of the 
three developed interesting techniques that led to proofs of different or more 
general results. This is a great example of the important role that lower bounds 
for a well-chosen problem can play in opening up new areas of research. Similar 
tools have also been used to provide lower bounds for the set agreement problem 
in a synchronous message-passing model [35] . Attiya reproved the impossibility 
of set consensus using more elementary tools [11]. 

Borowsky and Gafni’s impossibility proof [21] uses the protocol complex for 
fc + 1 processes, in the case where each process uses its process name as its input 
to the set consensus problem. They introduce the immediate snapshot models 
which puts restrictions on the adversarial scheduler and show that this model 
can be simulated in the standard asynchronous model. Additionally, protocols 
are assumed, without loss of generality, to be fulhinformation protocols: each 
process repeatedly takes a snapshot of shared memory, appending the result to 
its local state, and then writes its local state to shared memory. With these 
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simplifications of the models Borowsky and Gafni show that protocol complexes 
have a very regular form. This allows them to apply a variant of Sperner's 
Lemma to show that^ for some simplex of any protocol complex ^ each of the 
fc + 1 processes outputs a different value. Using the BG simulation technique ^ 
described in Section 5^ they extend the impossibility result from the wait-free 
setting to one where the number of failures is bounded by L 

The impossibility proof by Saks and Zaharoglou [100] ^ which uses point- 
set topology, has a different flavour from the other results described in this 
section. They use a simplified model similar to that of Borowsky and Gafni, 
and consider the space of all (finite and infinite) schedules. Saks and Zaharoglou 
define a topology on this set, where open sets are sets of schedules that can be 
recognized (that is, if there is an algorithm where some process eventually writes 
“accept'' for exactly those runs that follow a schedule in the set). Now, suppose 
a fc-set consensus algorithm exists for fc + 1 processes (where each process has its 
name as its input) . Then the set Di of schedules in which some process outputs 
the value i is an open set, and it does not contain any schedule where i does 
not take any steps. These facts can be used, together with Sperner’s Lemma, to 
show that there is one schedule contained in In this schedule, the processes 
output + 1 different values, which contradicts the correctness of the algorithm. 
An interesting direction for future research is to investigate the structure of this 
topological space of schedules. Perhaps theorems from point-set topology could 
then be applied to prove other results in distributed computing. 



7*3 The Asynchronous Computability Theorem 

The third paper that proved impossibility of fc-set consensus has since been de- 
veloped into a more general result that characterizes the tasks that can be solved 
in a wait-free manner using registers. Herlihy and Shavit [63] proved that a 
task is solvable if and only if it is possible to subdivide the simplexes of the input 
complex into smaller simplexes (with any newly created vertices being appro- 
priately labelled by processes) that can then be mapped to the output complex. 
This mapping /i must satisfy properties similar to those of a decision map 5. It 
must preserve the process labels on the vertices, map simplexes to simplexes, 
and it must respect the task specification: if a simplex I of the input complex is 
subdivided into smaller simplexes, the smaller ones must all be mapped to sim- 
plexes in A(J). This characterization is called the Asynchronous Computability 
Theorem. It reduces the question of whether a task is solvable to a question 
about properties of the complexes defined by the task specification. A key step 
in proving the necessity of the condition is a valency argument that shows the 
protocol complexes in this model contain no holes. To prove the impossibility 
of set consensus, they show (using SpernerN Lemma) that no mapping /i can 
exist for the set consensus task. The paper also gives results on the impossibility 
of renaming, a problem where all processes must choose distinct values from a 
small set. 

Gafni and Koutsoupias [49] used the Asynchronous Computability Theorem 
to show that it is undecidable whether a given task has a wait-free solution using 
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registers^ even for finite tasks with three processes. They use a reduction from 
a problem known to be undecidable: loop contractibilityj where one must decide 
whether or not a loop on a 2-dimensional simplicial complex can be contracted ^ 
like an elastic band^ to a point while staying on the surface of the complex. 
Suppose the input complex (for three processes) is simply a 2-simplex. Given 
a loop on a 2-dimensional output complex ^ they define a task that requires the 
boundary of the input simplex to be mapped to the loop by the function /i 
of the Asynchronous Computability Theorem. This map /i can be extended to 
the whole (subdivided) input simplex if and only if the loop can be contracted. 
Herlihy and Rajsbaum [58] extended this undecidability result to other models. 

Havlicek [52] used the Asynchronous Computability Theorem to identify a 
condition that is necessary for a task to have a wait-free solution using registers. 
This condition is computable for finite tasks. 

Several researchers have presented characterizations similar to the Asyn- 
chronous Computability Theory. These alternative views give further insight 
into the models and the proof techniques are quite different in some cases. Her- 
lihy and Rajsbaum [56] showed how to prove impossibility results in distributed 
computing using powerful ideas developed in the homology theory of simpli- 
cial complexes. They discussed models where the shared memory consists of 
registers, or of registers and set consensus objects. They reproved impos- 
sibility results for the set consensus problem, and gave some new results for the 
renaming problem. Attiya and Rajsbaum [15] used purely combinatorial argu- 
ments to develop a characterization of tasks solvable using registers, similar 
to the Asynchronous Computability Theorem. In particular, they showed that 
the protocol complexes for a simplified model have a very regular form. Both of 
these papers eliminated the need to subdivide the input complex by introducing 
maps that take simplexes of the input complex directly to sub complexes of the 
output complex. 

Borowsky and Gafni [23] gave an elegant proof of a version of the Asyn- 
chronous Computability Theorem without using topological arguments. They 
introduced the iterated immediate snapshot model and prove that it is capable 
of solving the same set of tasks as the ordinary register model. They prove the 
equivalence of the models by giving algorithms to simulate one model in the other 
[22,23]. The protocol complex of a (full-information) protocol in their simplified 
model is a well-understood subdivision of the input complex. Thus, a problem is 
solvable in either model if and only if there is a decision map from a subdivision 
of this form to the output complex that respects the task specification. 

7*4 Other models 

Herlihy and Rajsbaum [55] undertook a detailed investigation of the topology 
of set consensus. They gave conditions about the connectivity of protocol com- 
plexes that are necessary for the solution of set consensus. They also described 
connectivity properties of the protocol complexes in a model where the primitive 
objects are set consensus objects and registers. They later used this work 
to give a computable characterization of tasks that can be solved Cresiliently 
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in various models that allow processes to access consensus or set consensus 
objects [58]. The characterization uses topological tools but also builds on the 
characterization given by Biran^ Moran and Zaks (see Section 6.3) for systems 
using only registers. 

Herlihy and Rajsbaum [59] also considered an interesting class of decision 
tasksj called loop agreement tasks [58]. Using topological properties of the output 
complexes j they describe when one loop agreement task can be solved using 
registers and a single copy of an object that solves another loop agreement 
task. 

Herlihy^ Rajsbaum and Tuttle [61] used the topological approach to give uni- 
fied proofs of lower bounds for set consensus in several message-passing models 
with varying degrees of synchrony. There has been other work presenting impos- 
sibility results for several different models in a unified way [48^85^91]. 



7*5 Directions for future research 

A desirable goal is a better understanding of the structure of protocol complexes 
for different models. The complexes tend to be quite complicated ^ but to ob- 
tain impossibility results^ it is often sufficient to prove that they have certain 
properties^ without fully describing their form. Restrictions on the adversarial 
scheduler can also simplify the structure of protocol complexes ^ making them 
easier to study while simultaneously strengthening any lower bounds obtained. 
Most of the research has focused on one-shot objects or tasks; extensions of these 
techniques to long-lived objects is a subject of current research. 



8 Robustness 

The consensus number of an object type provides information about the power of 
a system that has objects of that type and registers. However ^ the classification 
of individual types into the consensus hierarchy does not necessarily provide 
complete information about the power of a system that contains several different 
types of objects: it is possible that a collection of weak types can become strong 
when used in combination. This issue was first addressed by Jayanti [66^68] 
in his papers on rob'astness of the consensus hierarchy. The hierarchy is robust 
(with respect to a class of object types) if it is impossible to obtain an n-process 
implementation of a type at level n of the hierarchy from a finite set of types 
that are each at lower levels. Robustness is a desirable property since it allows 
one to study the synchronization power of a system equipped with several types 
by reasoning about the power of each of the types individually. 

8*1 Non- Robustness Results 

A variety of non-robustness results have been proved during the past decade. 
Typically^ one defines a pair of objects that are tailor-made to work together 
to solve consensus easily. To complete the proofs one must show that each of 
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the types j when used by themselves ^ cannot solve consensus. These impossibility 
results often required ingenious lower bound techniques. 

Some of the early results on the robustness question showed that the hierar- 
chy is not robust under slightly different definitions of consensus number [40^66^ 
68j 75]. These results are^ in part^ responsible for the choice of the now-standard 
definition of consensus number. There have also been a number of non-robustness 
results when the response that objects return can depend on the identity of the 
process that invoked the operation [31^32^88^94]. 

One of Jayanti's proofs [66] used an interesting simulation technique. He de- 
fined a simple type^ called weak-sticky^ with infinite consensus number. He gave 
an implementation of weak- sticky from registers that is not wait-free but has 
the property that at most one operation on the object will fail to terminate. He 
used this to show that there is no consensus algorithm for three processes that 
uses a single weak-sticky object and registers; if there were^ one could use 
the implementation to obtain a 1-resilient consensus algorithm for three pro- 
cesses using only registers^ which is impossible [84]. The use of 'imperfect'' 
implementations to prove results has been used elsewhere (see Section 5) . 

Schenk [102^ 103] proved that the consensus hierarchy is not robust by con- 
sidering a type with unbounded non-determinism ^ i.e. an operation may cause 
an object to choose non-deterministically from an infinite number of possible 
state transitions. In this case^ he said an algorithm is wait-free if the number of 
steps taken by a process must be bounded ^ where the bound may depend on the 
input to the protocol. For objects with bounded non-determinism ^ this definition 
of wait-freedom is equivalent to the requirement that every execution is finite. 
Lo and Hadzilacos [83] improved Schenk's result by showing that the hierarchy 
is not robust even when restricted to objects with bounded non-determinism. 

Schenk defined two types ^ called lock and key. The key object is a simple 
non-deterministic object that can easily be used to solve the weak agreement 
problem: All processes must agree on a common output value and^ if all processes 
have the same input value ^ the output value must differ from it. He used a 
counting argument to show that^ for any consensus algorithm using keys and 
registers^ there exists a fixed output value for each key which is consistent 
with every execution. This allows all the key objects to be eliminated ^ which is 
impossible^ unless cons({key}) = 1. 

The lock object was specially constructed to provide processes with a solu- 
tion to the consensus problem if and only if processes can "convince'' the ob- 
ject that they can solve weak agreement. The lock object non-deterministically 
chooses an instance of the weak agreement problem and gives this instance to 
the processes as a challenge. It then reveals the solution to the original consensus 
problem if and only if processes provide the lock object with a correct solution 
to the challenge. (The idea of defining an object that only provides useful re- 
sults to operations when it is accessed properly^ in combination with another 
type of object^ was originated by Jayanti [68] and is common to many of the 
non-robustness proofs.) If processes have access to both a lock and key object^ 
they can use the key to solve the lock's challenge and unlock the solution to 
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consensus. Schenk used a type of valency argument developed by Lo [82] ^ to 
show that weak agreement and^ hence ^ consensus for two processes cannot be 
solved using only locks and registers. 

8*2 Robustness Results 

Although the consensus hierarchy is not robust in general ^ the practical im- 
portance of the non-robustness results is unclear ^ since the objects used in the 
proofs are rather unusual. The hierarchy has been shown to be robust for some 
classes of objects that include many of the objects commonly considered in the 
literature. 

Ruppert [98] showed the the hierarchy is robust for the class of all RMW 
and readable types. The proof uses the characterization of the types that can 
be used to solve consensus for n processes^ described above in Section 6.2. It 
is easy to show^ using a valency argument ^ that any consensus algorithm for n 
processes built from such objects must include an object whose type satisfies the 
conditions of the characterization. Therefore^ n-process consensus can be solved 
using only that type and registers. 

Recently^ Herlihy and Ruppert [62] used the topological approach to char- 
acterize the one-shot types that can be combined with other types to solve 
consensus. It follows from their characterization that the class of deterministic 
one-shot types is robust. The key tool in one direction of the proof is a simulation 
technique that builds on the BG simulation (see Section 5) . The other direction 
is a generalization of the non-robustness results [83^ 103] described above. 

8*3 Directions for Future Research 

In proving the robustness result for readable and RMW types ^ two important 
properties are used: such objects are deterministic and their state information 
can be accessed in some simple way by each process. The robustness result 
for one-shot types uses a similar property: when accessing a one-shot object^ a 
process gets all of the state information that it will ever be able to obtain directly 
from the object by doing a single operation. Can robustness results be extended 
to other natural classes of types that do not have these kinds of properties? By 
finding the line that separates those types that can be combined with others 
to violate robustness from those that cannot ^ we gain insight into the way that 
types behave when used in complex systems. Work on the robustness question 
has produced a number of interesting proof techniques^ and has required very 
careful definitions to avoid using “obviously true'' properties which are not. 
Clarifying definitions is one of the important contributions of lower bounds. 

9 Complexity Lower Bounds 

Once we know that a particular problem is solvable in a certain distributed 
system j we would like to have algorithms that solve these problems as efficiently 
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as possible j for example ^ using shared objects that are as small as possible ^ 
using as few shared objects as possible ^ and using as little time as possible. In 
this section j we discuss lower bounds on the resources needed to solve various 
problems. In some cases ^ these bounds show that certain algorithms are optimal 
or close to optimal. They also help us to understand the inherent difficulty of 
these problems. 

9*1 Lower Bounds on Space 

Burns j Jackson ^ Lynch ^ Fischer ^ and Peterson [26] considered deterministic solu- 
tions to the mutual exclusion problem using one shared object. In this problem ^ 
processes repeatedly compete for exclusive access to a critical section ^ where 
they are allowed to use a shared resource. A counting argument was used to 
show thatj if the object has an insufficient number of states ^ there are two con- 
figurations which will appear identical to a group of processes: one in which no 
processes are in the critical section and one in which some other process is in 
the critical section. If only the processes in this group are scheduled ^ they will 
behave the same way starting from both of these configurations ^ resulting in an 
incorrect execution in one of the two cases (with either no process ever entering 
the critical section or more than one process in the critical section at the same 
time). For randomized computation ^ Kushilevitz^ Mansour^ Rabin ^ and Zuck- 
erman [77] obtained lower bounds on the size of the shared object based on an 
analysis of Markov chains. 

Burns and Lynch [27,87] introduced the following technique to prove that 
any mutual exclusion algorithm for n > 2 processes that communicate using 
registers uses at least n registers, no matter how large the registers are. 
If an algorithm uses an insufficient number of objects, one can construct an 
execution that exhibits incorrect behaviour by combining pairs of different exe- 
cutions that look the same to a group of processes. An adversary scheduler runs 
the algorithm until there are processes covering every object that the algorithm 
uses. (A process covers an object if it will write to it when next allocated a 
step by the scheduler.) The effects of subsequent steps by other processes can be 
hidden by later performing these writes. Their lower bound is optimal, matching 
the number of registers used by known mutual exclusion algorithms [78,79]. 

Moran, Taubenfeld and Yadin [89] used the same approach to prove that any 
wait-free implementation of a mod m counter for n processes from objects with 
only 2 states must use at least min(^^, objects. 

Fich, Herlihy, and Shavit [45] considered a very weak termination condition, 
non- deterministic solo termination: at any point, if all but one process fails, 
there is an execution in which the remaining process terminates. In particular, 
wait-free and randomized wait-free algorithms satisfy non-deterministic solo ter- 
mination. They proved that Q{^/n) registers are needed by any asynchronous 
algorithm for n-process consensus that satisfies this property. The proof uses the 
covering technique together with a valency argument, showing that from any 
multivalent configuration, there is another multivalent configuration in which 
more registers are covered. Because consensus is a decision task, the proof is 
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more difficult than for mutual exclusion ^ where processes can repeatedly request 
exclusive access to the critical section ^ or for implementing a counter that pro- 
cesses can repeatedly increment. They overcome this problem by a new method 
of cutting and combining executions. Although there are algorithms for random- 
ized wait-free consensus among n processes that use 0{n) registers of bounded 
size [9]j it remains open whether this is optimal. 

Fichj Herlihy^ and Shavit extended their result to algorithms using history less 
objects. An object is historyless if its state depends only on the last non-trivial op- 
eration that was applied to it. Some examples of historyless types are register ^ 
swapj and test&set. Using this extension ^ they showed that n{y/n) history less 
objects are necessary for randomized wait-free implementations of objects such 
as compared swap j f etch& increment ^ and bounded counters ^ in an n-process 
system. Jayanti^ Tan and Toueg [71] improved these bounds ^ showing that n — 1 
historyless or resettable consensus objects are necessary for randomized wait- 
free implementations of these objects. Much work remains to be done to obtain 
space complexity lower bounds for other problems and in models with more 
powerful objects. Attiya^ Gorbach and Aloran [13] used similar techniques in a 
fault-free models where processes have no identifiers ^ run identical programmes ^ 
and communicate via registers. They showed that f2(logn) shared registers 
and G(logn) rounds are required for n processes to solve consensus. 

9*2 The Complexity of Universal Constructions 

Herlihy's universality result ^ discussed in Section 3^ and subsequent similar pa- 
pers [U33j54j73]j provide tmiversal constractions^ which automatically give a 
distributed implementation of any object type^ using sufficiently powerful shared- 
memory primitives. Jayanti [67^ 69] has studied some of the limitations of this 
approach to providing implementations. He showed that a process that performs 
a wait-free simulation of an operation using a universal construction requires 
n{n) steps of local computation in the worst case^ where n is the number of 
processes [67] . This bound does not depend on the nature of the communication 
between processes ^ and even holds in an amortized setting. The key idea in the 
proof is the design of an object type that conspires with the scheduler to reveal 
as little information about the behaviour of the object as possible. This ensures 
that each process^ simulating a single operation op^ must do some computation 
for each simulated operation that precedes op. The bound is tight [53] . 

Jayanti [69] also proved a lower bound of U(logn) on the number of shared- 
memory operations that must be performed by a universal construction in the 
worst case. This bound applies to a shared-memory model that has quite power- 
ful primitive types of shared objects and holds (for expected complexity) even if 
randomization is permitted. Jayanti proved the bound by considering the wakeup 
problem j where some process must detect when all processes have begun taking 
steps j and studying how information propagates through the system. Roughly 
speakings each shared-memory operation at most doubles the size of the the set 
of processes that are known (by some process or memory location) to have woken 
up. This lower bound is also tight [1]. 
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9*3 Lower Bounds on Time 

To understand the relative power of different models ^ it is important to obtain 
separation results by proving a lower bound for a problem in one model that is 
larger than its complexity in another model. 

One problem that has been used to obtain a separation result is approximate 
agreement j a variant of agreement that can be solved in asynchronous systems. 
In this problem^ each process receives a real input value in the range [0,1]* Their 
output values must lie within the range of the input values and must differ from 
one another by no more than some given parameter e > 0. Attiya^ Lynch ^ and 
Shavit [14] proved that any wait-free approximate agreement algorithm for n 
processes and e — 1/2^ using single-writer registers (of unbounded size)^ 
has a failure-free execution in which no process decides the value of its output 
before round n. They do this by obtaining an upper bound on the number of 
processes that can influence the state of a particular process during the first t 
rounds of a round-robin execution. Then they show that if a process P is not 
influenced by another process ^ it cannot decide; otherwise ^ there is another 
execution which is indistinguishable to P in which P^ runs to completion before 
the other processes begin and outputs an incompatible value. 

Schenk [101] proved that any wait-free approximate agreement algorithm 
for n processes that uses 6-bit registers must take F2(log(l/€)/6) rounds and 
use 12 (log(l /€)/&) registers. The proof considers the amount of information a 
process needs to determine its output value after all the other processes have 
decided. 

Schenk also gave an algorithm using 1-bit registers that matches these 
lower bounds. Together with the single-writer register lower bounds this 
implies that any wait-free implementation of registers from single -writer 
registers has round complexity 12(log n). However ^ there is a large gap between 
this lower bound and the best known implementation. It also remains open 
whether approximate agreement can be solved faster using larger registers. 

Hoest and Shavit [65] used topological techniques to determine the time com- 
plexity of approximate agreement in a generalization of Borowsky and Gafni's 
iterated immediate snapshot model. Essentially^ they related the time complexity 
of the task to the degree to which the input complex must be subdivided before 
one can map it to the output complex (see Section 7.3). Although^ in terms of 
computability^ their model is equivalent to the standard asynchronous model 
containing only registers^ their complexity results do not carry over. Much 
work remains to And additional ways of applying topology to prove complexity 
lower bounds. 

Sometimes j the choice of problem to use for a separation result comes from 
identifying the essential part of a simulation. One such example is the write~all 
problem: given n registers^ all initially 0^ set them all to 1. It has been used 
as the basis of simulations of synchronous algorithms on asynchronous shared- 
memory models. Buss^ Kanellakis^ Ragde^ and Shvartsman [28] proved that any 
asynchronous algorithm for this problem that uses n processes ^ at most half 
of which can fail^ must perform f2(nlogn) writes to the bits of the array. Their 
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proof holds in a very strong models where processes can flip coins and^ in a single 
stepj read the entire contents of shared memory. The idea is that an adversary 
schedules the processes to run until each covers a bit of the array. Among the 
bits with value 0^ the adversary chooses the half which have the fewest processes 
covering them and schedules the n/2 or more processes which cover other bits to 
perform their writes. This can be repeated log 2 U times^ each time reducing the 
number of 0 bits in the array by at most a factor of 2. They provide a matching 
upper bound when processes can perform atomic snapshots. For any e > 0^ 
there is a deterministic algorithm using only registers that performs 
operations^ in total [7^92]. It is an open question whether there is an algorithm 
for the write-all problem that uses only registers and performs n(logn)^^^^ 
total operations. 

9*4 Lower Bounds on Time for Randomized Computation 

Adding randomness to a model can increase its computational power. This can 
make proving lower bounds in randomized models more diflicult. Randomized 
consensus is a variant of consensus with a slightly weaker termination condition: 
all non-faulty processes must terminate within a finite expected number of steps. 
In contrast to consensus ^ randomized consensus can be solved in an asynchronous 
distributed message-passing system or in an asynchronous read-write shared- 
memory system (i.e. using only registers). For example^ there are wait-free 
shared-memory algorithms for randomized consensus among n processes where 
the expected total number of operations performed is 0(n^ logn) [25] and where 
the expected number of operations performed by each process is 0{nlog^ n) [10]. 
Alost algorithms for randomized consensus are based on collective coin flipping ^ 
which is a way of combining many local coin flips into a single global coin flip. 
However j there is a complication: a malicious adversary can destroy some of the 
local coins after they are tossed but before they are used. The goal of a collective 
coin flip algorithm is to limit the degree to which the adversary can influence 
the outcome of the global coin flip. 

Aspnes proved that any ^-resilient algorithm for randomized consensus on an 
asynchronous message-passing or read-write shared-memory system performs 
f2(^^/log^^) local coin flips (and^ hence ^ work) with high probability [8]. This 
result j in fact^ applies to all models that can be deterministically simulated by 
read- write shared memory^ including models that have counters or constant 
time atomic snapshot primitives. The proof of his lower bound has two parts. 
One is a lower bound on the number of local coin flips needed to prevent an ad- 
versary from having too much influence on the outcome of a collective coin flip. 
The other is an extension of the valency argument to the randomized setting to 
show that an algorithm either performs a collective coin flip with small bias or 
spends lots of local coin flips to avoid doing so. Aspnes introduces the notion of 
an a- univalent configuration^ a configuration from which an adversary scheduler 
can cause the algorithm to produce the output value with sufiiciently high 
probability. Then a bivalent configuration is both 0-univalent and 1-univalent 
and a nullvalent configuration is neither. He shows that^ with high probability^ 
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an adversary scheduler can force any algorithm into a bivalent or nnllvalent con- 
figuration from its initial configuration or whenever a local coin fiip is performed. 
He also proves that a bivalent configuration always leads to a nnllvalent config- 
uration or to a configuration in which a local coin fiip can be scheduled next. 
Finally^ in nnllvalent configurations ^ he shows that the coin fiipping lower bound 
applies. A polylogarithmic gap remains between the upper and lower bounds for 
the amount of work to solve randomized consensus on asynchronous models. 

Bar- Joseph and Ben-Or [17] extended Aspnes' result to synchronous message- 
passing systems j obtaining a lower bound of f2(t/^/n log n) rounds (with high 
probability) for t-resilient randomized consensus among n processes. They also 
gave a matching upper bound in this model. In contrasty for deterministic ah 
gorithmsj t + 1 rounds are needed [43]. If the power of the adversary scheduler 
is restricted so that its choices can only depend on the actions of the processes 
(and cannot depend directly on the outcome of coin fiips) ^ then faster algorithms 
are possible: there is a randomized consensus algorithm using registers with 
O(log^n) expected running time per process [30]. Against such non-adaptive 
adversaries j even expected constant time algorithms have been obtained [37^ 
44^95]. Byzantine agreement ^ where faulty processes can behave maliciously is 
more difficult than consensus. However^ no bigger lower bounds are known for 
randomized Byzantine agreement than for randomized consensus. 



10 Conclusions 

Why are lower bounds important for distributed computing? They help us to 
better understand the nature of distributed computing: what mades certain prob- 
lems hardj what makes a model powerful ^ and how different models compare. 
They tell us when to stop looking for better solutions or^ at leasts which ap- 
proaches will not work. If we have a problem that we need to solve, despite a 
lower bound, the lower bound may indicate ways to adjust the problem specifica- 
tion or the modelling of the environment to allow reasonable solutions. Finally, 
trying to prove lower bounds can suggest new and different algorithms, especially 
when attempts to prove the bounds fail. 

This survey has presented many lower bound results and different techniques 
for proving them. We hope it will encourage you to try to prove lower bounds 
for the distributed computing problems you encounter. 

If someone says "eanV that shows you what to do [29]. 
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Abstract. We present the first adaptive algorithm for iV-process mutual 
exclusion under read/write atomicity in which all busy waiting is by local 
spinning. In our algorithm^ each process p performs 0{win{k.^\og N)) 
remote memory references to enter and exit its critical section , where k is 
the maximum ‘Y^int contention” experienced by p. The space complexity 
of our algorithm is S{N)^ which is clearly optimal. 



1 Introduction 

In this paper, we consider adaptive solutions to the mutual exclusion problem 
[7] under read/ write atomicity. A mutual exclusion algorithm is adaptive if its 
time complexity is a function of the number of contending processes [6, 11, 13]. 
Two notions of contention have been considered in the literature: “interval con- 
tention” and “point contention” [1]. These two notions are defined with respect 
to a history H. The interval eontention over H is the number of processes that 
are active in 17, i.e., that execute outside of their noncritical sections in H. The 
point contention over H is the maximum number of processes that are active at 
the same state in H. Note that point contention is always at most interval con- 
tention. In this paper, we consider only point contention. Throughout the paper, 
we let N denote the number of processes in the system, and we let k denote the 
point contention experienced by an arbitrary process over a history that starts 
when it becomes active and ends when it once again becomes inactive. 

In previous work on adaptive mutual exclusion algorithms, two time com- 
plexity measures have been considered: “remote step complexity” and “system 
response time.” The remote step complexity of an algorithm is the maximum 
number of shared-memory operations required by a process to enter and then 
exit its critical section, assuming that each “await” statement is counted as one 
operation [13]. The system response time is the length of time between critical 
section entries, assuming each enabled read or write operation is executed within 
some constant time bound [6]. Several read/ write mutual exclusion algorithms 
have been presented that are adaptive to some degree under these time com- 
plexity measures. One of the first such algorithms was an algorithm of Styer 
that has 0{min{N,k log N)) remote step complexity and 0{min{N, k log N)) 
response time [13]. Choy and Singh later improved upon Styer’s result by pre- 
senting an algorithm with 0{N) remote step complexity and 0{k) response time 
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[6]. More recently, Attiya and Bortnikov presented an algorithm with 0{k) re- 
mote step complexity and O(logfc) response time [5]. 

Recent work on scalable local-spin mutual exclusion algorithms has shown 
that the most crucial factor in determining an algorithm Y performance is the 
amount of interconnect traffic it generates [4, 8, 10, 14]. In light of this, we 
define the time eomplexity of a mutual exclusion algorithm to be the worst- 
case number of remote memory references by one process in order to enter and 
then exit its critical section. A remote memory reference is a shared variable 
access that requires an interconnect traversal. In local-spin algorithms, all busy- 
waiting loops are required to be read-only loops in which only locally-accessible 
shared variables are accessed that do not require an interconnect traversal. On a 
distributed shared-memory multiprocessor, a shared variable is locally accessible 
if it is stored in a local memory module. On a cache-coherent multiprocessor, a 
shared variable is locally accessible if it is stored in a local cache line. 

The first local-spin algorithms were algorithms in which read-modify-write 
instructions are used to enqueue blocked processes onto the end of a “spin queue’^ 
[4, 8, 10]. Each of these algorithms has 0(1) time complexity; thus, adaptivity 
is clearly a non-issue if appropriate read-modify-write instructions are avail- 
able. Yang and Anderson were the first to consider local-spin algorithms under 
read/write atomicity [14]. They presented a read/write mutual exclusion algo- 
rithm with O(logiV) time complexity in which instances of a local-spin mutual 
exclusion algorithm for two processes are embedded within a binary arbitration 
tree. They also presented a “fast-path” variant of this algorithm that allows the 
tree to be bypassed in the absence of contention. Although the contention-free 
time complexity of this algorithm is 0(1), its time complexity under contention 
is 0{N) in the worst case, rather than O(logiV). In recent work, Anderson 
and Kim presented a new fast-path mechanism that results in with 0(1) time 
complexity in the absence of contention and O(logiV) time complexity under 
contention, when used with Yang and Anderson Y algorithm [3]. 

All of the previously-cited adaptive algorithms are not local-spin algorithms, 
and thus they have unbounded time complexity under the remote-memory- 
references time measure. One could argue that for an algorithm to be considered 
truly adaptive, it must be adaptive under this measure. After all, the underlying 
hardware does not distinguish between remote memory references generated by 
await statements and remote memory references generated by other statements. 
Surprisingly, while adaptivity and local spinning have been the predominate 
themes in recent work on mutual exclusion, the problem of designing an adaptive, 
local-spin algorithm under read/write atomicity has remained open. In this pa- 
per, we close this problem by presenting an algorithm that has 0(mm(fc, log N)) 
time complexity under the remote-memory-references measure. 

Our algorithm can be seen as an extension of the fast-path algorithm of An- 
derson and Kim [3]. This algorithm was devised by thinking about connections 
between fast-path mechanisms and long-lived renaming [12]. Long-lived renam- 
ing algorithms are used to “shrink” the size of the name space from which process 
identifiers are taken. The problem is to design operations that processes may in- 
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shared variables X: {_L} U {0..iV — 1} init _L; 

Y: boolean init false 

private variable dir: {D^R^S} /+ down, right, stop +/ 
1: X:=p; 

2: if F then dir := R 
else 

3: Y := true; 

4: if X = p then dir := S 

else dir := D 
fi 
fi 



Fig. 1. The splitter element and the code fragment that implements it. 



voke in order to acquire new names from the reduced name space when they 
are needed, and to release any previously-acquired name when it is no longer 
needed. In Anderson and Kim’s algorithm, a particular name is associated with 
the fast path; to take the fast path, a process must first acquire the fast-path 
name. Our adaptive algorithm can be seen as a generalization of Anderson and 
Kim’s fast-path mechanism in which every name is associated with some “path” 
to the critical section. The length of the path taken by a process is determined 
by the point contention that it experiences. 



2 Adaptive Algorithm 

In our adaptive algorithm, code sequences from several other algorithms are 
used. In Sec. 2.1, we present a review of these other algorithms and discuss 
some of the basic ideas underlying our algorithm. Then, in Sec. 2.2, we present 
a detailed description of our algorithm. 



2.1 Related Algorithms and Key Ideas 

At the heart of our algorithm is the splitter element from the grid-based long- 
lived renaming algorithm of Aloir and Anderson [12]. This splitter element was 
actually first used in Lamport’s fast mutual exclusion algorithm [9]. The splitter 
element is defined by the code fragment shown in Fig. 1. (In this and subsequent 
figures, we assume that each labeled sequence of statements is atomic; in each 
figure, each labeled sequence reads or writes at most one shared variable.) Each 
process that invokes this code fragment either stops, moves down, or moves right 
(the move is defined by the value assigned to the variable dir). One of the key 
properties of the splitter that makes it so useful is the following: if n processes 
invoke a splitter, then at most one of them can stop at that splitter, at most 
n — 1 can move right, and at most n — 1 can move down. 
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11 (a) (b) 



Fig. 2. (a) Renaming grid (depicted for iV = 5). (b) Renaming tree. 



Because of these properties, it is possible to solve the renaming problem by 
interconnecting a collection of splitters in a grid as shown in Fig. 2(a). A name 
is associated with each splitter. If the grid has N rows and N columns, then by 
induction, every process eventually stops at some splitter. When a process stops 
at a splitter, it acquires the name associated with that splitter. In the long-lived 
renaming problem [12], processes must have the ability to release the names 
they acquire. In the grid algorithm, a process can release its name by resetting 
each splitter on the path traversed by it in acquiring its name. A splitter can be 
reset by resetting its Y variable to true. For the renaming mechanism to work 
correctly, it is important that a splitter be reset only if there are no processes 
‘downstream” from it (i.e., in the sub-grid “rooted” at that splitter). In Moir 
and Anderson Y algorithm, it takes 0{N) time to determine whether there are 
“downstream” processes. This is because each process checks every other process 
individually to determine if it is downstream from a splitter. As we shall see, a 
more efficient reset mechanism is needed for our adaptive algorithm. 

The main idea behind our algorithm is to let an arbitration tree form dynam- 
ically within a structure similar to the renaming grid. This tree may not remain 
balanced, but its height is proportional to contention. The job of integrating the 
renaming aspects of the algorithm with the arbitration tree is greatly simplified 
if we replace the grid by a binary tree of splitters as shown in Fig. 2(b). (Since 
we are now working with a tree, we will henceforth refer to the directions associ- 
ated with a splitter as stop, left, and right.) Note that this results in many more 
names than before. However, this is not a major concern, because we are really 
not interested in minimizing the name space. The arbitration tree is defined by 
associating a three-process mutual exclusion algorithm with each node in the re- 
naming tree. This three-process algorithm can be implemented in constant time 
using the local-spin mutual exclusion algorithm of Yang and Anderson [14]. We 
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ME algorithm process p process q 






Fig. 3. (a) Renaming tree and overflow tree, (b) Process p gets a name in the renaming 
tree, (c) Process q fails to get a name and must compete within the overflow tree. 



explain below why a three-process algorithm is needed instead of a two-process 
algorithm (as one would expect to have in an arbitration tree). 

In onr algorithm, a process p performs the following basic steps. (For the 
moment, we are ignoring certain complexities that must be dealt with.) 

Step 1 p first acquires a new name by moving down from the root of the renaming 
tree, until it stops at some node. In the steps that follow, we refer to this 
node as p’s acquired node, p’s acquired node determines its starting point in 
the arbitration tree. 

Step 2 p then competes within the arbitration tree by executing each of the three- 
process entry sections on the path from its acquired node to the root. Note 
that a node’s entry section may be invoked by the process that stopped at 
that node, and one process from each of the left and right subtrees beneath 
that node. This is why a three-process algorithm is needed. 

Step 3 After competing within the arbitration tree, p executes its critical section. 

Step 4 Upon completing its critical section, p releases its acquired name by reopen- 
ing all of the splitters on the path from its acquired node to the root. 

Step 5 After releasing its name, p executes each of the three-process exit sections 
on the path from the root to its acquired node. 

If we were to use a binary tree of height iV, just as we previously had a 
grid with N row and N columns, then the total number of nodes in the tree 
would be 0(2^). We circumvent this problem by defining the tree’s height to be 
[logiVJ, which results in a tree with 0{N) nodes. With this change, a process 
could “fall off” the end of the tree without acquiring a name. However, this can 
happen only if contention is J?(logiV). To handle processes that “fall off the 
end,” we introduce a second arbitration tree, which is implemented using Yang 
and Anderson’s local-spin arbitration-tree algorithm [14]. We refer to the two 
trees used in our algorithm as the renaming tree and overflow treCj respectively. 
These two trees are connected by placing a two-process version of Yang and 
Anderson’s algorithm on top of each tree, as illustrated in Fig. 3(a). Fig. 3(b) 
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illustrates the steps that might be taken by a process p in acquiring a new name 
if contention is O(logiV). Fig. 3(c) illustrates the steps that might be taken by 
a process q if contention is J?(logiV). 

A major difficulty that we have ignored until this point is that of efficiently 
reopening a splitter, as described in Step 4 above. In Moir and Anderson Y re- 
naming algorithm, it takes 0{N) time to reopen a splitter. To see why reopening 
a splitter is difficult, consider again Fig. 1. If a process does succeed stopping 
at a splitter, then that process can reopen the splitter itself by simply assigning 
Y := true. On the other hand, if no process succeeds in stopping at a splitter, 
then some process that moved left or right from that splitter must reset it. Unfor- 
tunately, because processes are asynchronous and communicate only by means 
of atomic read and write operations, it can be difficult for a left- or right-moving 
process to know whether some process has stopped at a splitter. 

Anderson and Kim solved this problem in their fast-path mutual exclusion 
algorithm by exploiting the fact that much of the reset code can be executed 
within a process Y critical section [3]. Thus, the job of designing efficient reset 
code is much easier here than when designing a wait-free long-lived renaming 
algorithm. As mentioned earlier, in Anderson and Kim’s fast-path algorithm, 
a particular name is associated with the fast path; to take the fast path, a 
process must first acquire the fast-path name. In our adaptive algorithm, we 
must efficiently manage acquisitions and releases for a set of names. 



2.2 Detailed Description 

Having introduced the major ideas that underlie our algorithm, we now present a 
detailed description of the algorithm and its properties. We do this in three steps. 
First, we consider a version of the algorithm in which unbounded memory is used 
to reset splitters in constant time. Second, we consider a variant of the algorithm 
with 0(iV^) space complexity in which all variables are bounded. Third, we 
present another variant that has 0{N) space complexity. In explaining these 
algorithms, we actually present proof sketches for some of the key properties of 
each algorithm. Our intent is to use these proof sketches as a means for intuitively 
explaining the basic mechanisms of each algorithm. A formal correctness proof 
for the final algorithm is presented in the full version of this paper [2] . 

Algorithm U. The first algorithm, which we call Algorithm U (for unbounded), 
is shown in Fig. 4. Before describing how this algorithm works, we first examine 
its basic structure. At the top of Fig. 4, definitions of two constants are given: 
D, which is the maximum level in the renaming tree (the root is at level 0), and 
r, which gives the total number of nodes in the renaming tree. As mentioned 
earlier, the renaming tree is comprised of a collection of splitters. These splitters 
are indexed from 1 to T. If splitter i is not a leaf, then its left and right children 
are splitters 2i and + 1, respectively. 

Each splitter i is defined by four shared variables and an infinite shared array: 
X[i]j Y[i]j Reset[i]j Rnd[i] (the array), and Aequired[i]. Variables A^[i] and F[i] 
are as in Fig. 1, with the exception that F[i] now has an additional 
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const 

D = [logiVJ; /+ depth of renaming tree = 0(log iV) +/ 

T = 2^”*"^ — 1 size of renaming tree = 0(N) +/ 

type 

Ytype = record free: boolean; rnd: O..oo end; /+ stored in one word +/ 
Dtype = {L^ S}; /+ splitter moves (left, right, stop) +/ 

Ptype = record nd: 1..2T + 1; dir: Dtype end /+ path information +/ 



shared variables 

X: array[l..T] of 0. . 00 ; 

F, Reset: array[l..T] of Ytype init (true^ 0); 
Rnd: array[l..T][0..oo] of boolean init false; 
Aequired: array[l..T] of boolean init false 



private variables 
nd, n: 1..2T + 1; 

Ivl, j: 0..D + 1; 
y: Ytype; dir: Dtype; 
path: array[0..D] of Ptype 



process p :: /+0<p<iV+/ 

while true do 
0: Noncritical Section; 

1: nd, Ivl := 1, 0; 

/ + descend renaming tree + / 

repeat 

2: X[nd], (Ur := p, S; 

3: y := Y[nd]; 

if -ny.free then dir := R 
else 

4: := {false, 0); 

5: if X[nd] ^ p V 

6: Aequired[nd] then 

dir := L 

else 

7: Rnd[nd][y .rnd] := true; 

8: if Reset[nd] ^ y then 

9: Rnd[nd][y .rnd], dir := false, L 

fififi; 

10: path[lvl] := {nd, dir); 

if dir ^ S then 

Ivl, nd := Ivl + 1,2- nd; 
if dir = R then nd := nd + 1 fi 

fi 

until {Ivl > D) V (dir = S); 

if Ivl < D then / + got a name + / 

11: Aequired[nd] := true; 

for j := Ivl downto 0 do 
12: ENTRYa(|>at/^[j].nd, path\j].dir) 

od; 

13: ENTRY2(0) 

else / + didn^t get a name + / 

14: ENTRYiv(p); 

15: ENTRY2(1) 

fi; 



16: Critical Section; 

/+ reset splitters +/ 
for j := rnin{lvl, D) downto 0 do 
if path\j].dir / R then 



17: 


n := path\j].nd; 




18: 


y := Reset[n]; 




19: 


Reset[n] := {fals* 


e, y.rnd)-, 


20: 


if j = Ivl V 






-iEnd[n][^.rnd] then 


21: 


Reset[n] := {true, y.rnd + ' 


22: 


F[n] := {true. 


y.rnd + 1) 



fi 

fi 

od; 



/ + execute exit sections + / 

if Ivl < D then 
23: EXIT2(0); 

for d := 0 to Ivl do 

24: EXITa(|>at/^[d].nd, path[j].dir) 

od; 

25: Aequired[nd] := false 

else 

26: EXIT2(1); 

27: EXITiv(p) 

fi 
od 



Fig. 4. Algorithm U: adaptive algorithm with unbounded memory. 
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integer rnd field. As explained below, Algorithm U works by associating “round 
numbers” with the various rounds of competition for the name corresponding to 
each splitter. In Algorithm U, these round numbers grow without bound. The 
rnd field of F[i] gives the current round number for splitter i. Reset[i] is used 
to reinitialize the rnd field of F[i] when name i is released. Rnd[l][r] is used to 
identify a potential “winning” process that has succeeded in acquiring name i 
in round r. Acquired[l] is set when some process acquires name i. 

Each process descends the renaming tree, starting at the root, until it either 
acquires a name or “falls off the end” of the tree, as discussed earlier. A process 
determines if it can acquire name i by executing statements 2-10 with nd — i. 
Of these, statements 2-5 correspond to the splitter code in Fig. 1. Statements 
6”9 are executed as part of a handshaking mechanism that prevents a process 
that is releasing a name from adversely interfering with processes attempting to 
acquire that name; this mechanism is discussed in detail below. Statement 10 
simply prepares for the next iteration of the repeat loop (if there is one). 

If a process p succeeds in acquiring a name while descending within the 
renaming tree, then it competes within the renaming tree by moving up from 
its acquired name to the root, executing the three-process entry sections on this 
path (statements 11-12). Each of these three-process entry sections is denoted 
“EMTRY3 (n, d),” where n is the corresponding tree node, and d is the “identity” of 
the invoking process. The “identity” that is used is simply the invoking process’s 
direction out of node n (S', L, or R) when it descended the renaming tree. After 
ascending the renaming tree, p invokes the two-process entry section “on top” 
of the renaming and overfiow trees (as illustrated in Fig. 3(a)) using “0” as a 
process identifier (statement 13). This entry section is denoted “EMTRY2(0).” 

If a process p does not succeed in acquiring a name while descending within 
the renaming tree, then it competes within the overfiow tree (statement 14), 
which is implemented using Yang and Anderson’s iV-process arbitration-tree al- 
gorithm. The entry section of this algorithm is denoted EMTRYiv(p)- Note that p 
uses its own process identifier in this algorithm. After competing within the over- 
fiow tree, p executes the two-process algorithm “on top” of both trees using “1” 
as a process identifier (statement 15). This entry section is denoted “EMTRY2(1).” 

After completing the appropriate two-process entry section, process p exe- 
cutes its critical section (statement 16). It then resets each of the splitters that it 
visited while descending the renaming tree (statements 17-22). This reset mech- 
anism is discussed in detail below. Process p then executes the exit sections 
corresponding to the entry sections it executed previously (statements 23-27). 
The exit sections are specified in a manner that is similar to the entry sections. 

We now consider in detail the code fragments that are executed to acquire 
(statements 2-10) or reset (statements 18-22) some splitter i. To facilitate this 
discussion, we will index these statements by i. For example, when we refer to 
the execution of statement 4[i] by process p, we mean the execution of statement 
4 by p when its private variable nd equals i. Similarly, lS[i] denotes the execution 
of statement 18 with n = i. 

As explained above, one of the problems with the splitter code is that it 
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is difficult for a left- or right-moving process at splitter i to know which (if 
any) process has acquired name i. In Algorithm U, this problem is solved by 
viewing the computation involving each splitter as occurring in a sequence of 
rounds. Each round ends when the splitter is reset. During a round, at most one 
process succeeds in acquiring the name of the splitter. Note that it is possible 
that no process acquires the name during a round. So that processes can know 
the current round number at splitter an additional rnd field has been added 
to F[i]. This field will increase without bound over time, so we will never have 
to worry about round numbers being reused. 

With the added rnd field, a left- or right-moving process at splitter i has a 
way of identifying a process that has acquired the name at splitter i. To see how 
this works, consider what happens during round r at node i. Of the processes 
that participate in round r at node i, at least one will read F[i] = (true^r) at 
statement 3[i] and assign F[i] := {false ^0) at statement 4[l]. By the correctness 
of the original splitter code, of the processes that assign F[i], at most one will 
reach statement 7[i]. A process that reaches statement 7[i] will either stop at 
node i or be defiected left. This gives us two cases to analyze: of the processes 
that read F[i] = (true^r) at statement 3[i] and assign F[i] at statement 4[i], 
either all are defiected left, or one, say p, stops at splitter i. 

In the former case, at least one of the left-moving processes finds Rnd[l][r] to 
be false at statement 20 [i], and then reopens splitter i by executing statements 
21[i] and 22 [i], which establish F[i] = {truCj r + 1) A E[i] = Reset[i]. To 
see why at least one process executes statements 21 [i] and 22[i], note that each 
process under consideration reads F[i] = (true^r) at statement 3[i], and thus its 
y.rnd variable equals r while executing within statements 4[i]-9[i]. Note also that 
Rnd[i][r] — true is established only by statement 7[i]. Aloreover, each process 
defiected left at statement 9[i] first assigns Rnd[l][r] := false. Thus, at least one 
of the left-moving processes finds Rnd[l][r] to be false at statement 20 [i]. 

In the case that there is a winning process p that stops at splitter i during 
round r, we must argue that (i) p reopens splitter i upon leaving it, and (ii) no 
left- or right-moving process “prematurely” reopens splitter i before p has left it. 
Establishing (i) is straightforward. Process p will reopen the splitter by executing 
statements 18[i]-22[i] and 25, which establish F[i] = (^me,r+l) A Acquired[l] = 
false A F[i] = Reset[l]. Note that the assignment to Acquired at statement 25 
prevents the reopening of splitter i from actually taking effect until after p has 
finished executing its exit section. 

To establish (ii), suppose, to the contrary, that some left- or right-moving 
process reopens splitter i by executing statement 22[i] while p is executing within 
statements 10[i]-13 and 16-25. (Note that, because p stops at splitter i, it doesnft 
iterate again within the repeat loop.) Let q be the first left- or right-moving 
process to execute statement 22[i]. Since we are assuming that the ENTRY and 
EXIT calls are correct, q cannot execute statement 22[l] while p is executing 
within statements 16-22. Aloreover, if p is executing within statements 12-13 or 
23-25, then Acquired is true, and hence the splitter is closed. The remaining 
possibility is that p is enabled to execute statement 10[i] or 11. (Note that, in 
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this case, if q were to reopen splitter i then we conld end np with two processes 
concurrently invoking EMTRY 3 (f,S') at statement 12 , i.e., both processes use S 
as a “process identifier.” The ENTRY calls obviously cannot be assumed to work 
correctly if such a scenario could happen.) 

So, assume that q executes statement 22[i] while p is enabled to execute 
statement 10[i] or 11. For this to happen, q must have read Rnd[i][r] = false 
at statement 20[i] before p assigned Rnd[l][r] := true at statement 7[l]. (Recall 
that all the processes under consideration read F[i] = (true^r) at statement 
2[i]. This is why p writes to Rnd[l][r] instead of some other element of Rnd[i]. q 
reads from Rnd[l][r] at statement 20[i] because it is the first process to attempt to 
reset splitter i, which implies that q reads Reset[l] = {true, r) at statement 18[i].) 
Because q executes statement 20[i] before p executes statement 7[i], statement 
19[i] is executed by q before statement S[l] is executed by p. Thus, p must have 
found Reset[l] ^ y at statement 7[l], i.e., it was defiected left at splitter i, which 
is a contradiction. It follows from the explanation given here that splitter i is 
eventually reset for round r + 1 , i.e., we have the following property. 

Property 1; Let S be the set of all processes that read Y[i],rnd = r at statement 
3[i]. IfS is nonempty, then F[i] = {true, r+1) A F[i] = Reset[i] A Aequired[i] = 
false is eventually established, and at all states after it is first established, no 
process in set S stops at splitter i. □ 

Because the splitters are always reset properly, it follows that the ENTRY and 
EXIT routines are always invoked properly. If these routines are implemented 
using Yang and AndersonY local-spin algorithm, then since that algorithm is 
starvation-free. Algorithm U is as well. 

Having dispensed with basic correctness, we now informally argue that Al- 
gorithm U is contention sensitive. For a process p to descend to a splitter at 
level I in the renaming tree, it must have been defiected left or right at each 
prior splitter it accessed. Just as with the original grid-based long-lived renam- 
ing algorithm [ 12 ], this can only happen if the point contention experienced by 
p is 0{l). Note that the time complexity per level of the renaming tree is con- 
stant. Aloreover, with the ENTRY and EXIT calls implemented using Yang and 
Anderson Y algorithm [14], the ENTRY 2 , EXIT 2 , ENTRY 3 , and EXIT 3 calls take con- 
stant time, and the ENTRYiv and ENTRYiv calls take 0(logiV) time. Note that the 
ENTRY iv and ENTRY m routines are called by a process only if its point contention 
is J?(logiV). Thus, we have the following. 

Lemma 1; Algorithm U is a correct, starvation-free mutual exclusion algorithm 
with 0{Tmn{k,logN)) time complexity and unbounded space complexity. □ 

Of course, the problem with Algorithm U is that the rnd field of F[i] that 
is used for assigning round numbers grows without bound. We now consider a 
variant of Algorithm U in which space is bounded. 

Algorithm, B, In Algorithm B (for bounded), which is shown in Fig. 5, modulo- iV 
addition (denoted by ©) is used when incrementing Y[i],rnd. With this change. 




Adaptive Mutual Exclusion with Local Spinning 



39 



the following potential problem arises. A process p may reach statement S[i] 
in Fig. 5 with y.rnd = r and then be delayed. While delayed, other processes 
may repeatedly increment Y[i].rnd (statement 27[i]) until it “cycles back’^ to 
r. Another process q could then reach statement S[l] with y.rnd — r. This is a 
problem because p and q may interfere with each other in updating Rnd[i][r]. 

Algorithm B prevents such a scenario from happening by preventing Y[i].rnd 
from cycling while a process p that stops at splitter i executes within statements 
8[i]”31. Informally, cycling is prevented by requiring process p to erect an “ob- 
stacle” that prevents Y[i].rnd from being incremented beyond the value p. Alore 
precisely, note that before reaching statement 8[i], process p must first assign 
Obstade\p] := i at statement b[l]. Note further that before a process can incre- 
ment Y[i].rnd from r to r © 1 (statement 27[i]), it must first read Obstade[r] 
(statement 25[i]) and find it to have a value different from i. This check prevents 
Y[i].rnd from being incremented beyond the value p while p executes within 
statements 8[i]-31. Note that process p resets Obstade\p] to 0 at statement 18. 
This is done to ensure that p^s own obstacle doesnff prevent it from incrementing 
a splitter’s round number. 

To this point, we have explained every difference between Algorithms U and 
B except one: in Fig. 5, there are added assignments to elements of Y and X 
(statements 20 and 21) after the critical section. The reason for these assignments 
is as follows. Suppose some process p is about to assign Obstade\p] := true at 
statement 5[i], but gets delayed. (In other words, p is “about to” erect an obstacle 
at splitter i.) We must ensure that if p ultimately reaches statement 8[i], then 
Y[l].rnd does not get incremented beyond the value p. Let r be the value read 
from Y.rnd by p at statement 3[i]. For Y.rnd to be incremented beyond p, some 
other process q that reads Y.rnd = r must attempt to reopen splitter i. 

So, suppose that process q reopens splitter i by executing statement 27 [i] 
while p is delayed at statement b[i]. If process q executes statement 21[i] after 
p executes statement 2[i], then p will find A^[i] ^ p at statement 6[i] and will 
be defiected left. So, assume that q executes statement 21[i] before p executes 
statement 2[i]. This implies that q establishes Y[i].free — false by executing 
statement 20[i] before p reads F[i] at statement 3[i]. Note that Y[i].free = true 
is only established within a critical section (statement 27[l]). Also, note that we 
have established the following sequence of statement executions (perhaps inter- 
leaved with statement executions of other processes): q executes statements 20 [i] 
and 21[i];p executes statements 2[i]-5[i]; q executes statement 27[i] (g’s execution 
of statements 22[i]-26[i] may interleave arbitrarily with p’s execution of state- 
ments 2[i]-5[i]). Because statements 17[i]-27[i] are executed as a critical section, 
this implies that p reads Y[i].free = false at statement 3[i], and thus does not 
reach statement 5[i], which is a contradiction. We conclude from this reasoning 
that if p is delayed at statement 5[i], and if p ultimately reaches statement 8[i], 
then Y[l].rnd does not get incremented beyond the value p. 

From the discussion above, we have the following property and lemma. 

Property 2; If distinct processes p and q have executed statement 7[i] and have 
nd = i, then the value of p’s private variable y.rnd differs from that of g’s. □ 
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/+ all variable declarations are as defined in Fig. 4 except as noted here +/ 

type 

Ytype = record free: boolean; rnd: 0..N — 1 end /+ stored in one word +/ 

shared variables 

X: array [1..T] of 0..iV - 1; 

Rnd: array[l..T][0..iV — 1] of boolean init false; 

Obstacle: array[0..iV - 1] of 0..T init 0; 

Acquired: array [1..T] of boolean init false 



process p :: / ^ 0 < p < N ^ / 

while true do 
0: Noncritical Section; 

1: ndy Ivl := ly 0; 

/ + descend renaming tree + / 

repeat 

2: X[nd], dir:=p, S; 

3: y := Y[nd]; 

if -ny.free then dir := R 
else 

4: := ifalsCy 0); 

5: Obstaele\p] := nd; 

6: if X[nd] ^ p V 

7: Aequired[nd] then 

dir := L 
else 

8: Rnd[nd][y.rnd] := true; 

9: if Reset[nd] ^ y then 

10: Rnd[nd][y .rnd]y dir := false y L 

fififi; 

11: path[lvl] := (ndy dir); 

if dir ^ S then 

Ivly nd := Ivl + 1^2* nd; 
if dir = R then nd := nd + 1 fi 

fi 

until {Ivl > D) V (dir = S); 

if Ivl < D then /+ got a name +/ 

12: Aequired[nd] := true; 

for j := Ivl downto 0 do 
13: ENTRYa(|>at/^[;].nd^ path[j].dir) 

od; 

14: ENTRY2(0) 

else /+ didn^t get a name +/ 

15: ENTRYiv(p); 

16: ENTRY2(1) 

fi; 



17: Critical Section; 

18: Obstaele\p] := 0; 

/+ reset splitters +/ 
for j := min{lvly D) downto 0 do 
if path\j].dir ^ R then 
19: n := path\j].nd; 

20: Y[n] := {falsCy 0); 

21: X[n]:=p-, 

22: y := Reset[n]; 

23: Reset[n] := {falsCy y.rnd); 

24: if {j = Ivl V 

^Rnd[n][y.rnd]) A 

25: Obstaele[y.rnd] ^ n then 

26: Reset[n] := {truCy y.rnd 0 1) 

27: Y[n] := {truCy y.rnd 0 1) 

fi; 

28: if j = Ivl then 

Rnd[y.rnd] := false 

fi 

fi 

od; 

/+ execute exit sections +/ 

if Ivl < D then 
29: EXIT2(0); 

for j := 0 to Ivl do 
30: EXIT ^ipath[j]. ndy path[j].dir) 

od; 

31: Aequired[nd] := false 

else 

32: EXIT2(1); 

33: EXITiv(p) 

fi 

od 



Fig. 5. Algorithm B: adaptive algorithm with O(iV^) space complexity. 
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Lemma 2; Algorithm B is a correct, starvation-free mutual exclusion algorithm 
with 0{min{k^logN)) time complexity and 0(iV^) space complexity. □ 

The 0(iV^) space complexity of Algorithm B is due to the Rnd array. We 
now show that this 0(iV^) array can be replaced by a 0(iV) linked list. 

Algorithm L. In Algorithm L (for linear), which is depicted in Fig. 6, a common 
pool of round numbers ranging over is used for all splitters in the 

renaming tree. As we shall see, 0{N) round numbers suffice. In Algorithm B, our 
key requirement for round numbers was that they not be reused “prematurely. 
With a common pool of round numbers, a process should not choose r as the next 
round number for some splitter if there is a process anywhere in the renaming 
tree that “thinks” that r is the current round number of some splitter. 

Fortunately, since each process selects new round numbers within its critical 
section, it is fairly easy to ensure this requirement. All that is needed are a few 
extra data structures that track which round numbers are currently in use. These 
data structures replace the Obstaele array of Algorithm B. The main new data 
structure is a queue Free of round numbers. In addition, there is a new shared 
array Inuse, and a new shared variable Cheek We assume that the Free queue 
can be manipulated by the usual Enqueue and Dequeue operations, and also by 
an operation MoveToTail{Free, i: 1..U), which moves i to the end of Free, if it 
is in Free, If Free is implemented as a doubly-linked list, then these operations 
can be performed in constant time. We stress that Free is accessed only within 
critical sections, so it is really a sequential data structure. 

When comparing Algorithms B and L, the only difference before the critical 
section is statement b[i]: instead of updating Obstaele\p], process p now marks the 
round number r it just read from F[i] as being “in use” by assigning Inuse\p] := 
r. The only other differences are in the code after the critical section (statements 
18-33 in Fig. 6). Statements 24-27 are executed to ensure that no round number 
currently “in use” can propagate to the head of the Free queue. In particular, if 
a process p is delayed after having obtained r as the current round number for 
some splitter, then while it is delayed, r will be moved to the end of the Free 
queue by every critical section execution. With U — T+2iV round numbers, 
this is sufficient to prevent r from reaching the head of the queue while p is 
delayed. {T + 2N round numbers are needed because the calls to Dequeue and 
MoveToTail can cause a round number to migrate toward the head of the Free 
queue by two positions per critical section execution.) Statement 28 [i] enqueues 
the current round number for splitter i onto the Free queue. (Note that there 
may be other processes within the renaming tree that “think” that the just- 
enqueued round number is the current round number for splitter i; this is why 
we need a mechanism to prevent round numbers from prematurely reaching the 
head of the queue.) Statement 29[i] simply dequeues a new round number from 
Free, The rest of the algorithm is the same as before. 

The space complexity of Algorithm L is clearly 0(iV), if we ignore the space 
required to implement the ENTRY and EXIT routines. (Each process has a 0(log N) 
path array. These arrays are actually unneeded, as simple calculations can be 
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/+ all variable declarations are as detined in t ig. 5 except as noted liere +/ 
const U = T ~t 2N /+ number of possible round numbers = 0(N) +/ 

type Ytype = record free: boolean; rnd: 0..U end /+ stored in one word +/ 



shared variables 

Reset: array[l..T] of Ytype; 

Rnd: array[l..f/] of boolean init false; 
Free: queue of integers; 

Inuse array [0..iV — 1] of 0..U init 0; 
Cheek: 0..N init 0 

process p :: / ^ 0 < p < N ^ / 

while true do 
0: Noncritical Section; 

1: ndy Ivl := 0; 

/ + descend renaming tree + / 

repeat 

2: X[nd], dir := p, S'; 

3: y Y[nd]; 

if -ly./ree then dir := R 
else 

4: Y[nd] := (false.^ 0); 

5: Inuse\p] := y.rnd; 

6: if X{nd\ ^ p V 

7: AejiuireA[nd\ then 

dir := L 
else 

8: Rnd[y.rnd] := true; 

9: if Reset[nd] ^ y then 

10: Rnd[y.rnd]y dir := false. ^ L 

nnn; 

11: path[lvl] := (ndy dir); 

if dir ^ S then 

Ivly nd := Ivl + 1^2* nd; 
if dir = R then nd := nd ~t 1 

fi 

until (Ivl > D) V (dir = S); 

if Ivl < D then /+ got a name +/ 

12: Aequired[nd] := true; 

for j := Ivl downto 0 do 
13: EMTBR^(path[j].nd^ path[j].dir) 

od; 

14: ENTRY2(0) 

else /+ didn^t get a name +/ 

15: ENTRYiv(p); 

16: ENTRY2(1) 

fi; 



initially 

(Yi : 1 < i < T :: Yp] = (true^ i) A 
Reset[i] = (true^ i)) A 
(Free = (T + 1) ^ > U) 

private variables 

ptr: 0..N — 1; nxtrd: 1..U; usdrd: 0..U 

17: Critical Section; 

/+ reset splitters +/ 
for j := min(lvl^ D) downto 0 do 
if path\j\.dir / R then 
18: ri := path\f].nd; 

19: Y[n] := (false^ 0); 

20: X[n]:=p; 

21: y := Reset[n]; 

22: Reset[n] := (false.^ y.rnd); 

23: if j = Ivl V -^Rnd[y.rnd] then 

24: ptr := Cheek; 

25: usdrd := Inuse[ptr]; 

26: if usdrd / 0 then 

MoveTo Tail (Free ^ usdrd) 

fi; 

27: Cheek := ptr (B 1; 

28: Enqueue(Free^ y.rnd); 

29: nxtrd := Dequeue (Free); 

30: Reset[n] := (true^ nxtrd); 

31: Y[n] := (true^ nxtrd) 

fi; 

if j = Ivl then 

32: Rnd[y. rnd] := false; 

33: Inuse\p] := 0 

fi fi 
od; 

/ + execute exit sections + / 

if Ivl < D then 
34: EXIT2(0); 

for j := 0 to Ivl do 
35: EXIT ^(path[j].nd ^ path[j].dir) 

od; 

36: Aequired[nd] := false 

else 

37: EXIT2(1); 

38: EXITiv(p) 

fi 

od 



Fig. 6. Algorithm L: adaptive algorithm with 0(N) space complexity. 
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used to determine the parent and children of a splitter.) If the ENTRY/EXIT 
routines are implemented using Yang and Anderson’s arbitration-tree algorithm 
[14], then the overall space complexity is actually 0(iVlogiV). This is because 
in Yang and Anderson’s algorithm, each process needs a distinct spin location 
for each level of the arbitration tree. However, as we will show in the full paper, 
it is quite straightforward to modify the arbitration-tree algorithm so that each 
process uses the same spin location at each level of the tree. This modified 
algorithm has 0{N) space complexity. We conclude by stating our main theorem. 

Theorem 1. N -process mutual exclusion can be implemented under read/write 
atomicity with time complexity 0(mm(fc,log Y)) and space complexity &{N). □ 

Acknowledgement: Gary Peterson recently conjectured to us that adaptivity under 
the remote-memory-references time measure must necessitate space complexity. 

His conjecture led us to develop Algorithm L. 



References 

1. Y. Afek, H. Attiya, A. Fouren, G. Stupp, and D. Touitou. Long-lived renaming 
made adaptive. In Proceedings of the 18th ACM Symposium on Prineiples of 
Distributed Computing.^ pages 91-103, 1999. 

2. J. Anderson and Y.-J. Kim. Adaptive mutual exclusion with local spinning (full 
version of this paper). At http : //www. cs .imc . edu/^anderson/papers .html. 

3. J. Anderson and Y.-J. Kim. Fast and scalable mutual exclusion. In Proceedings 
of the 13th International Symposium^ on Distributed Computing^ pages 180-194, 
September 1999. Full version to appear in Distributed Computing. 

4. T. Anderson. The performance of spin lock alternatives for shared-memory multi- 
processors. IEEE Trans, on Parallel and Distributed Sys., 1(1):6-16, 1990. 

5. H. Attiya and V. Bortnikov. Adaptive and efficient mutual exclusion. To appear in 
Proceedings of the 19th ACM Symposium on Prineiples of Distributed Computing. 

6. M. Ghoy and A. Singh. Adaptive solutions to the mutual exclusion problem. Dis- 
tributed Computing, 8(1):1-17, 1994. 

7. E. Dijkstra. Solution of a problem in concurrent programming control. Commu- 
nications of the ACM, 8(9):569, 1965. 

8. G. Graunke and S. Thakkar. Synchronization algorithms for shared-memory mul- 
tiprocessors. IEEE Computer, 23:60-69, 1990. 

9. L. Lamport. A fast mutual exclusion algorithm. ACM Trans, on Computer Sys., 
5(1):1-11, 1987. 

10. J. Mellor-Grummey and M. Scott. Algorithms for scalable synchronization on 
shared-memory multiprocessors. ACM Trans, on Computer Sys., 9(l):21-65, 1991. 

11. M. Merritt and G. Taubenfeld. Speeding Lamport’s fast mutual exclusion algo- 
rithm. Information Processing Letters, 45:137-142, 1993. 

12. M. Moir and J. Anderson. Wait-free algorithms for fast, long-lived renaming. Sci- 
ence of Computer Programming, 25(1): 1-39, 1995. 

13. E. Styer. Improving fast mutual exclusion. In Proceedings of the 11th ACM Sym- 
posium on Prineiples of Distributed Computing, pages 159-168. 1992. 

14. J.-H. Yang and J. Anderson. A fast, scalable mutual exclusion algorithm. Dis- 
tributed Computing, 9(l):51-60, 1995. 




Bounds for Mutual Exclusion with only 
Processor Consistency 



LisaHigham^ and Jalal Kawash^^ 

Department of Computer Seienee, The University of Calgary, Canada 
{higham | kawash}0cpsc . ucalgary . ca 



Abstract. Most weak memory eonsisteney models are ineapable of supporting 
a solution to mutual exelusion using only read and write operations to shared 
variables. Proeessor Consisteney-Goodman’s version (PC-G) is an exeeption. 
Ahamad et al.[l] showed that Peterson’s mutual exelusion algorithm is eorreet 
for PC-G, but Lamport’s bakery algorithm is not. In this paper, we derive a lower 
bound on the number and type (single- or multi-writer) of variables that a mu- 
tual exelusion algorithm must use in order to be eorreet for PC-G. We show that 
any sueh solution for n proeesses must use at least one multi-writer and n single- 
writers. This lower bound is tight when n = 2, and is tight when n > 2 for so- 
lutions that do not provide fairness. We show that Bums’ algorithm is an unfair 
solution for mutual exelusion in PC-G that aehieves our bound. However, five 
other known algorithms that use the same number and type of variables do not 
guarantee mutual exelusion when the memory eonsisteney model is only PC-G, 
as opposed to the Sequential Consisteney model for whieh they were designed. 
A eorollary of this investigation is that, in eontrast to Sequential Consisteney, 
multi-writers eannot be implemented from single-writers in PC-G. 



1 Introduction 

The Mutual Exclusion Problem is the most famous and well-studied problem in concur- 
rency. Following Silberschatz et al.[14], we refer to this problem as the Critical Section 
Problem (CSP) to distinguish the problem from the Mutual Exclusion Property. In CSP, 
a set of processes coordinate to share a resource, while ensuring that no two access the 
resource concurrently. CSP solutions for memories that satisfy Sequential Consistency 
(SC) have been known since the 1960s; Raynal [13] provides an extensive survey. In 
fact, as shown by Lamport [10], even single-reader single-writer bits suflfice to solve the 
critical section problem, as long as accesses to these objects are guaranteed to be SC. 

Most weak memory consistency models, however, are incapable of supporting a 
solution to CSP using only read and write operations on shared variables [7]. Mutual 
exclusion on weak memory consistency models such as Java, Coherence, Pipelined- 
RAM, Total and Partial Store Ordering, Causal Memory, and several variants of Pro- 
cessor Consistency requires the use of expensive built-in synchronization primitives 
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such as locks, compare-and-swap, fetch-and-add and others [7]. A notable exception is 
Processor Consistency (abbreviated PC-G) ^ as proposed by Goodman and formalized 
by Ahamad et al.[l]. Though weaker than SC, this variant of processor consistency 
guarantees that processes have just enough agreement about the current state of shared 
memory to support a solution using only reads and writes of shared variables. 

Ahamad et al. have shown that Peterson’s mutual exclusion algorithm [12] is cor- 
rect for PC-G, but that Lamport’s bakery algorithm [8] fails for PC-G [1]. We are thus 
motivated to determine what is necessary and sufficient to solve CSP with only PC-G 
memory using only reads and writes to shared variables. For example, Peterson’s al- 
gorithm makes use of multi-writers, variables that can be written my more than one 
process, while Lamport’s bakery algorithm [8] uses only single-writers, variables that 
can be written by exactly one process. Are multi-writers essential? 

In this paper, we derive tight bounds on the number and type (single- or multi- 
writer) of variables that a mutual exclusion algorithm must use in order to be correct for 
PC-G. Specifically, any PC-G solution for n processes must use at least one multi-writer 
and n single- writers. We prove that Bums’ algorithm [3], which uses one multi-writer 
and n single-writers, is an unfair solution for mutual exclusion in PC-G. Thus our bound 
is tight for unfair solutions to CSP. Since Peterson’s 2-processor algorithm is fair and 
correct for PC-G, our bound is tight even for fair solutions when n = 2. 

We fiirther investigate properties that a solution, using one multi-writer and n single- 
writers, must satisfy in order to be correct for PC-G. Using these properties, we estab- 
lish that five algorithms [13], Dekker’s, Dijkstra’s, Knuth’s, De Bmijn’s, Eisenberg and 
MacGuire’s, do not guarantee mutual exclusion under only PC-G memory consistency. 
All of these have been developed for SC [9], and all use one multi-writer and n single- 
writers. However, most of these algorithms are fair solutions for CSP in SC. The only 
fair solution we have found for PC-G is Peterson’s which uses n—l multi-writers and 
n single- writers. 

Since multi-writers are required to solve CSP in PC-G, a corollary of our investiga- 
tion is that, in contrast to SC, multi-writers cannot be implemented from single-writers 
in PC-G. 

Section 2 includes the definitions needed for this paper. Section 3 provides a tem- 
plate for our impossibility proofs, which is used to establish our lower bounds in Section 
4. The major results in Section 4 have been automatically verified using the SPIN model 
checker [6]. 

2 Definitions 

2.1 Multiprocess Systems and Memory Consistency Models 

A multiprocess system can be modeled as a collection of processes operating on a col- 
lection of shared data objects. For this paper, the shared data objects are variables sup- 
porting only read and write operations, where r(x)v and w(x)v denote, respectively, 
a read operation of variable x returning v and a write operation to x of value v. An 

^ Several variants of Proeessor Consisteney exist. The one referred to in this paper is due to 
Ahamad et al.’s[l] interpretation of Goodman’s original work [5]. 
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operation can be decomposed into invocation (performed by processes) and response 
(returned by variables) components. 

It suffices to model a process as a sequence of read and write invocations, and a 
multiprocess system as a collection of processes together with the shared variables. 
Henceforth, we denote a multiprocess system by the pair {PjJ) where P is a set of 
processes and / is a set of variables. A process computation is the sequence of reads 
and writes obtained by augmenting each read invocation in the process with its matching 
response. A (multiprocess) system computation is a collection of process computations, 
one for each process in the collection. 

Let O be all the (read and write) operations in a computation of a system {P^J). 
Then, 0\p denotes all the operations that are in the process computation of process 
p £ P; 0\x denotes all the operations that are applied to variable x € /, and 0\w denotes 
all the write operations. These notations are also combined to select those operations 
satisfying several restrictions at once. For example, 0\w\x\p is the set of all write oper- 
ations by process p to variable x. 

A sequence of read and write operations to variable x is valid if and only if each 
read in the sequence returns the value of the most recently preceding write. Given any 
collection of read and write operations G on a set of variables /, a linearization of O 

is a (strict) linear order^ (G, such that for each variable x in /, the subsequence 

(0\xj -^) of (G, -^) is valid. A linear order (G, -^) is also represented in this paper 
as the sequence L = • • •) where ot precedes oj in L if and only if (oi^Oj) € 

Let G be a set of operations in a computation of a system {P^J). Define the program 
order, denoted {O^ by o \ ^^02 if and only if 02 follows o\ in the computation of 

P- 

A (memory) consistency model is a set of constraints on system computations. A 
system {P^J) satisfies memory consistency D if every computation that can arise from 
it meets all the constraints in D. 

We allow a (sequential) program to be any computer code containing control struc- 
tures, local variables together with any computable operations on them, and reads and 
writes of global variables. (The global variables are so restricted because we are inter- 
ested in what can be achieved with just reads and writes of variables for communication 
between processes.) Then a (multiprocess) algorithm is just a collection of such pro- 
grams where all global variables are shared. (Often, but not necessarily, each program 
in the collection is the same.) The algorithm together with the memory consistency 
model can produce some set of system computations, where each program gives rise to 
a process as defined above. 

Four memory consistency models are referred to in this paper. SC [9] is a strong 
memory consistency model that arises (for example) when the memory shared between 
processes is single-ported and thus all reads and writes to the memory are serialized. 
Thus, SC guarantees that the computation of the system is the result of some interleav- 

^ A (strict) partial order (simply, partial order) is an anti-reflexive, transitive relation. Denote a 
partial order by a pair {S,R). The notation s\Rs2 means (^^^2) E R. A (strict) linear order is 
a partial order (S,R) sueh that Vx,_y £ S xfy, either xRy or yRx. 
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ing of the processes. This model is typically assumed by algorithm designers, and a 
challenge for system designers is to build SC systems while exploiting the efficiencies 
of distributed shared memory. 

Definition 1. Let O be all the operations of a computation C of a multiprocess system 
(PjJ). Then C satisfies SC if there is a linearization ((9,^^) such that C 

If the single-ported globally shared memory is partitioned into several components 
each of which has separate single-ported access, then SC of the whole system is lost, 
but is maintained for each component individually. In the extreme, when each shared 
variable has its own access channel, the memory consistency model is called Coherence 
[4]. In a Coherent memory model, reads and writes of different variables can happen in 
time in the opposite order from program order. However, such a system still ensures that 
for each shared variable the outcome of the computation results from some interleaving 
of the process reads and writes to that variable. 

Definition 2. Let O be all the operations of a computation C of a multiprocess system 
Then C satisfies Coherence if for each variable x E O there is a linearization 

((9|x^ ) ) such that ((9|x^ f ) C ((9|x^ ) ). 

Now consider a message-passing network of processes each of which stores a local 
copy of the shared memory. If the message channels are FIFO and form a complete net- 
work, reads are implemented by consulting the local memory, and writes are broadcast 
to every other process, then the memory consistency model that arises is the Pipelined- 
Random Access Machine (P-RAM)[1 1]. 

Definition 3. Let O be all the operations of a computation C of a multiprocess system 
(P^f). Then C satisfies P-RAM if for each process p E P there is a linearization {0\pU 

0\wj — ^) such that {0\p U 0\wj C {0\p U 0\wj — ). 

For a memory model to meet PC-G [1], there must be a set of linearizations that 
simultaneously satisfy both Coherence and P-RAM. 

Definition 4. Let O be all the operations of a computation C of a multiprocess system 
{PjJ). Then C satisfies PC-G if for each process p E P there is a linearization {0\p U 

0\wj — ^) such that 

1. {0\p U 0\w^ C {0\p U ), and 

2. \/qEP and Vx E J, {0\w\x^ — ^) = {0\w\x^ —lyf 

2.2 Critical Section Problem 

We denote a CSP problem by CSP(n) where n is the number of processes in the system. 
Each process has the following structure: 
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repeat 

<remainder> 

<entry> 

<critical section> 

<exit> 
until false 

A solution to CSP(w), n>2, must satisfy the following two properties^ 

- Mutual Exclusion: At any time there is at most one process in its <critical section> 

- Progress: If at least one process is in <entry>, then eventually one will be in 
<critical section>. 

CSP typically requires some notion of fairness as well. One possible fairness property 
is: 

- Fairness: If a process p is in <entry>, then p will eventually be in <critical 
section>. 

It is possible to consider stronger notions of fairness. We will see, however, that our im- 
possibility and lower bound results apply even to unfair solutions of CSP, and therefore 
we make no fairness requirement in our definition. 

Notice that time is used in the definition of CSP. However, we make no assumptions 
about agreement in rate or value between the clocks that are part of the multiprocess 
system, and, therefore, the memory consistency models considered here have been de- 
fined without reference to time. So we need to clarify how a system without a consistent 
notion of time can be tested for a property involving time. The multiprocess system ex- 
ists in some environment that has its own meaningful time which we call real time. In 
the case of CSP, which is controlling access to some resource, real time can be taken to 
be the local clock time of that resource. For a system to satisfy the Mutual Exclusion 
property it is required that there is no computation of that system for which there are 
two or more processes in their critical sections at the same real time. 

Let A and D be an algorithm and a memory consistency model, respectively. Then, 
A solves CSP for D if for every system S that satisfies D, every computation of S satisfies 
Mutual Exclusion and Progress. 

3 Template for Impossibility and Lower Bound Proofs 

We will use the partial computations 1 , 2, and 3 defined below. First, assume for the sake 
of contradiction that there exists an algorithm A that solves CSP(^) for a given memory 
consistency model, D, for n > 2. This solution must work when exactly two processes, 
say p and q, are participating and the rest engaging in <remainder>. If A runs with p 
in <entry> while q stays in <remainder>, then by the Progress property, p must enter 
its <critical section> producing a partial computation of the form of Computation 1, 
where X denotes the empty sequence and 6>f denotes the operation of p and ^ is a 
finite natural number. 



^ Other forms of defining solution properties are possible as is given by Attiya et al.[2]. 
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Computation! | ^ {P is in its < critical section >) 

Similarly, if A runs with only q's participation, Progress guarantees that Computation 2 
exists. 



Computation 2 



p : X 

q\ {q is in its < critical section >) 



Now, consider Computation 3 where both p and q are participating, but both are in their 
<critical section>. By assumption, both computations 1 and 2 satisfy D. If we can 
show that Computation 3 also satisfies memory consistency condition D, the desired 
contradiction is achieved, since Mutual Exclusion is violated by Computation 3, but it 
is a possible outcome of algorithm A. This would imply that there is no algorithm that 
solves CSP(w) for memory consistency model D. 



^ ... ^ ( P • (pis in its < critical section >) 

Computations f f ^ . . . . ^ / 

[ ^ [q IS in Its < critical section >) 

None of the arguments in the following theorems depends on the Fairness property, so 
the impossibilities apply to unfair solutions as well. Furthermore, none of these argu- 
ment depends on the size of variables. So, these results apply to unbounded variables 
as well. 



4 Bounds on CSP for PC-G 

Ahamad et al. [ 1 ] proved that Peterson’s algorithm [12], which was originally developed 
for SC systems, solves CSP(2) for PC-G. Given algorithm A 2 that solves CSP(2) for PC- 
G, an algorithm A„ that solves CSP(n) for PC-G, where n>2, can be constructed from 
A 2 by building a tournament tree. Processes are partitioned into sets of size two each. 
For each set, A 2 is used to select a “winner”. The winners are again partitioned into 
sets of size two, and A 2 can be used in this manner repeatedly until only one winner 
remains. Thus we conclude that there is an algorithm that solves CSP(n) for PC-G. 

This section further investigates bounds and restrictions on these PC-G solutions. 



4.1 Type of Variables 

A multi-writer variable (simply, multi-writer) can be updated by any number of pro- 
cesses in the system, while a single-writer variable (simply, single-writer) can be up- 
dated by exactly one designated process. 

We show that the use of multi-writers is crucial to solve CSP on PC-G. First we 
need the following lemma. 

Lemma 1. In a system {PjJ) where J consists entirely of single -writers, PC-G is equiv- 
alent to P-RAM. 
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Proof: Obviously, PC-G is at least as strong as P-RAM. We show that without the use 
of multi-writer variables, P-RAM is at least as strong as PC-G. Let {0\p U G|w, — ^) 

and {0\qU0\w, -^) be linearizations for p and q£ P that are guaranteed by P-RAM. 
Since, for any variable x € /, there is only one process, say that writes to x, and both 

{0\pU0\w, — ^) and {0\qU0\w,—^) have all these writes to x in the program order of 
the order of the writes to x in {0\p U G|w, -^) is the same as the order of the writes 
to X in {0\qU0\wj Therefore, the definition of PC-G (Definition 4) is satisfied. ■ 

CSP, however, is impossible for P-RAM: 

Theorem 1. There does not exist an algorithm that solves CSP(n) for P-RAM, even if 
n = 2. 

Proof: Assume that there is an algorithm A that solves CSP(w) for P-RAM. Then 

computations 1 and 2 exist. Define the following sequences for p and q, respectively, 
for Computation 3. 



(6>|pU6>|w, A-) = 

Clearly, each preserves as required by the definition of P-RAM. Also, each is a 
linearization because the first part (for instance, ...,6>^)) corresponds to a possi- 
ble computation, and the second part (for instance, ...,6>^)|w) contains only writes. 
Thus, Computation 3 is P-RAM. Therefore, our assumption must have been in error 
and A does not exist. ■ 

Theorem 2. There does not exist an algorithm that uses only single -writers and solves 
CSP(n)for PC-G, even ifn — 2. 

Proof: This follows immediately from Lemma 1 and Theorem 1 . ■ 

Ahamad et al.[l] also prove that Lamport’s Bakery algorithm [8], which uses only 
single-writers, is incorrect for PC-G. The consequence of Theorem 2 is that any CSP 
solution for PC-G must use at least one multi-writer. 

Vitanyi and Awerbuch [15] showed that multi-writer variables can be constructed in 
a waitfree manner from single-writer variables. In PC-G, there is no (even non-waitfree) 
construction of multi-writer variables from single-writer variables. 

Corollary 1. Multi-writers cannot be implemented from single-writers in PC-G mem- 
ory systems. 



Proof: Peterson’s algorithm solves CSP for PC-G using multi-writers, and there 

is no solution with only single writers by Theorem 2. Hence, multi-writers cannot be 
constructed from single-writers in PC-G. ■ 
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4.2 Number of Variables 

After showing that at least one multi-writer is required by a CSP solution for PC-G, a 
natural question is what is the minimum number of variables needed to solve CSP(n) 
for PC-G? 

Theorem 3. There does not exist an algorithm that uses fewer than n single-writers 
and one multi-writer and solves CSP(n)for PC-G, for any n>2. 

Proof: Assume that there is an algorithm A that uses fewer than n single-writers and 
one multi-writer and solves CSP(n) for PC-G. Since there are n processes, the pigeon- 
hole principle ensures that there is at least one process, say p, that does not write to any 
single-writer variable. Computations 1 and 2 must exist. We show that Computation 3 
satisfies PC-G. 

Let of- be ^’s first write to the multi-writer. The following are the required PC-G 
linearizations for p and q. 

{0\pV^0\w,^) = {o{,- ■ ■ • ,o‘)\w) 

{0\qU0\w,^) = 

Both sequences maintain program order. Moreover, /7’s sequence is valid because it 
consists of Computation 1 followed by only writes by q. Also, ^’s sequence is valid 
because the segment ^ does not contain any writes to the multi-writer. Since p 

does not write to the single-writer, the segment K contains only writes to the 

multi-writer. The segment o^j-'jO^ starts with a write to the multi-writer over-writing 
any changes the segment caused. Therefore both are linearizations. 

Also, each linearization lists p's writes to the multi-writer followed by q's. Since 
only q writes to any single-writers, the two linearizations also agree on the order of this 
variable. So, both linearizations agree on the order of writes for each variable (Condition 
2 of Definition 4). ■ 

When n — 2, the bound of theorem 3 is tight, even if all variables are allowed to be 
multi-writers. 

Theorem 4. Two variables are insufficient to solve CSP(2)for PC-G. 

Proof: Assume that there is an algorithm A that uses exactly 2 variables, say x and 

y, (even multi-writers) and solves CSP(2) for PC-G. Then, computations 1 and 2 exist. 
We show that Computation 3 satisfies PC-G. 

Partition p's computation of Computation 3 into subsequences where 

each subsequence 5f is defined by: 

1. Sq contains all operations from o^ up to but not including the first write by p, 
labeled o ^^ . 

2. Sf, i> I, contains all operations from Oai up to but not including the first write, 
labeled o ^.^^ , such that o^. and o^.^^ are applied to different variables. 
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Partition ^’s computation of Computation 3 into subsequences ...^Sr similarly. 

The subsequence Sq is either empty or consists entirely of reads returning initial 
values. Each subsequence Sf (i > 1) starts with a write and all the writes in Sf are 
applied to the same variable. If the writes in Sf are applied to x, Sf is called x-gender; 
otherwise, it is called y-gender. Note that Sf (Sf) alternate in gender. 

To show that Computation 3 satisfies PC-G, we consider two cases (the other two 
cases are symmetric). 

is an x-gender but is ay-gender: Define {0\pU0\w^ and {0\qU0\w^ 

as follows. 

{0\pU0\w,^} = {SP, {Sl)\w, 5f, {S\)\w, ■■■, {Sf)\w, ■■■) 

{0\qU0\w,^} = {Si, {SP)\w, S\, (5f)K •••, {Sf)\w, ■■■) 

Clearly, {0\p U 0\w, — ^) and {0\qU0\w,—^) maintain program order. They are also 
valid because, for each i> I, Sf (respectively, Sf) is of the same gender as (re- 
spectively, Sf_^^). Since Sf and Sf_^^ are of the same gender, adding {Sf)\w immediately 
before Sf_^^ does not affect p's computation because Sf_^^ starts with a write that oblit- 
erates the changes caused by {Sf)\w; similarly for Sf and . 

The order on the writes to x in p's linearization is: 

(5f)|w, (5f)|w, •••, (5f)|w, {S^^i)\w, •••, (where ! is odd) 

which is the same order maintained by q's linearization. The same applies to y. There- 
fore, Condition 2 of Definition 4 is also satisfied. 

sf and S\ are both x-gender: Define {0\p U and {0\qU 0\w^ as 

follows. 



{0\pU0\w,^} = ((5«)K SP, {St)\w, 5f, ■■■, (5f)K sf, ■■■) 

{0\qU0\w,^} = {Si, S\, {Sf)\w, Si {Sf)\w, 5f, •••, {Sf)\w, Sf^^, ■■■) 

Similar analysis to the previous case shows that these are PC-G linearizations. 

Thus, in all cases. Computation 3 is PC-G, and our assumption must have been in 
error. ■ 

Since at least one multi-writer is necessary to solve CSP for PC-G, and since two 
multi-writers are insufiicient to solve CSP(2) for PC-G, and since Peterson’s Algorithm 
for CSP(2) uses exactly two single-writers and one multi-writer, we conclude the fol- 
lowing. 

Corollary 2. Two single-writers and one multi-writer are the necessary and sufficient 
number and type of variables required to solve CSP(2) for PC-G. 
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Algorithm 


Year 


1^1 


Variables 


flag Values 


Fairness Delay 


Dekker’s 


1965 


n 




2 


1 


2 


oo 


Dijkstra’s 


1965 


n 


> 


2 


1 


3 


oo 


Knuth’s 


1966 


n 


> 


2 


1 


3 


2«-i ^ 1 


De Bruijn’s 


1967 


n 


> 


2 


1 


3 


{n^ — n) jl 


Eisenberg and MaeGuire’s 


1972 


n 


> 


2 


1 


3 


n—\ 


Bums’ 


1981 


n 




~2 


1 


2 


OO 


Peterson’s 


1981 


n 






2n-\ 


2 


(jf —n)/2 



Fig. 1. Well known solutions to CSP for Sequential Consisteney 



4.3 The General Case 

By theorems 2 and 3, an algorithm that solves CSP(n) for PC-G must use at least n 
single-writers and one multi-writer. Most algorithms that solve CSP(n) for SC use ex- 
actly this number and type of variables. In particular, all the algorithms discussed in 
this section (except Peterson’s which uses n single-writers and n—l multi-writers) use 
the same number of variables: one multi-writer (turn) and n single- writers. Further- 
more, each process writes and reads turn, and each process i is associated with the 
single-writer flag [/] . Every process j ^ i reads flag [/] . These algorithms are quoted 
in Appendix A and listed in Figure 1, which characterizes each algorithm by four at- 
tributes: number of processes |P| = n, number of variables, number of values that a 
flag variable can be assigned, and fairness delay. This fairness delay is the maximum 
total number of times other processes can enter their critical sections before a certain 
process gets the opportunity to enter its critical section. When the there is no upper 
bound on the fairness delay (oo), the algorithm is prone to starvation, and is thus unfair. 

Although this number of variables is a necessary requirement for a PC-G solution, 
we show next that most of these algorithms do not solve CSP(n) for PC-G. First, we 
provide some rules-of-thumb that allows us to nail down certain properties of correct so- 
lutions for PC-G. Then, these rules are used to show that Dekker’s, Dijkstra’s, Knuth’s, 
De Bruijn’s, and Eisenberg and MacGuire’s fail to solve CSP(w) for PC-G. 

Lemma 2. Any algorithm that uses exactly n single -writers and one multi-writer and 
solves CSP(n) for PC-G must satisfy each of the following properties: 

1. Each process writes one single-writer at least once in <entry>. 

2. Each process must write the multi-writer at least once in <entry>, and this write 
cannot be the last operation in <entry>. 

3. Each process must read every other single-writer in <entry>. 

Proof: We follow the proof template given in Section 3. 

1. Assume it is not the case; then there is at least one process, say p, that does not 
write to any single-writer. The linearizations used in Theorem 3 apply. 
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2. Assume that a process p either does not write the multi-writer in <entry> or does 
write the multi-writer exactly once and this write operation is o^. Under this as- 
sumption, Computation 3 satisfies PC-G as shown by the following linearizations. 

(6>|pU6>|w, A) = 

{0\qU0\w,^} = 

Both maintain program order and are valid. They also maintain the same order on 
the writes to the multi-writer, which is simply q's writes then o^. Note that this case 
is equivalent to the case where multi-writer is written in the <critical section> 
rather than in <entry>. 

3. Assume, for the sake of contradiction, that there is a process, q, that does not read 
some single-writer of another process p. The linearizations of Theorem 3 apply. 



Corollary 3. The following CSP algorithms do not solve CSP(n) for PC-G, even if 
n = 2: 

1 . Dijkstra ’s Algorithm 

2. Dekker’s Algorithm 

3. De Bruijn ’s Algorithm 

4. Knuth’s Algorithm 

5. Eisenberg and MacGuire ’s Algorithm 

Proof: First, note that all these algorithms (reproduced in Appendix A) use n single- 
writers and one multi-writer. 

In Dijkstra ’s Algorithm, if the multi-writer turn is initially p, p enters its <critical 
section> without writing to the multi-writer. In Dekker’s and Bruijn’s algorithms, the 
multi-writer is only written in <exit>. In Knuth’s, and in Eisenberg and MacGuire ’s 
algorithms, the multi-writer is only written as the last step in <entry>. By Lemma 
2(2), all of these algorithms are incorrect for PC-G. 



Theorem 5. Bums’ Algorithm is an unfair CSP(n) solution for PC-G. 

Proof: Mutual Exclusion: Assume for the sake of contradiction that there exists 

some PC-G computation of Bums’ Algorithm where two processes, say i and j, execute 
in their <critical section> concurrently. Then, i (respectively, j) must read flag[j] 
(respectively, flag[/]) to hQ false at line 11 before entering its <critical section> as 
shown by the following computation. 



Computation 4 



i\ ... r(f lag[j] < critical section > 
j: ... r{ileigU])false < critical section > 
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flagCO .. n—Y} in {true, false] 
turn in {O,--- ,n — 1} 

<entry> 

1 flagDl e- true 

2 turn i 

3 repeat 

4 while (turn 7 ^ i) do 

5 flag CO 4™ false 

6 if not flagCjl) then 

7 flag CO e- true 

8 turn 4™ i 

9 end-if 

10 end- while 

11 until not flagCjl) 

<critical section> 



<exit> 

12 flag CO e- false 

Processes have unique identifiers from the set {0, — 1}, where n is the total number of processes. The 
algorithm is given by specifying the <entry> and <exit> sections of process i, i € {0, ■ ■ • — 1}. 



Fig. 2. Bums’ CSP unfair solution 



Note that when a process, say i, executes a w(f lag[/] )true, the next operation it 
executes is a w(turn)/. Let w(turn)/ be the last write operation to turn that i executes 
before entering its <critical section> (This write could be performed at line 2 or 8.) 
Similarly, let w(turn) j be the last write to turn that j did before entering its <critical 
section>. 

Since Computation 4 satisfies PC-G, the two linearizations {0\iU0\w, -^) and 

Lj 

{0\jU 0\w^ —^) must exist such that both agree on the order of writes to turn. Without 
loss of generality, suppose w(turn)/ precedes w(turn)j in both linearizations. Since 

w(turn)j —4 r{ileigU])false (by program order), w(turn)/ —4 r(f lag[/] 

There must be some write w(f lag [/] )true, such that this write is the last write by i that 
precedes w(turn)/ in j’s view. Since w(turn)/ is the last write by i before it enters its 
<critical section>, w(f lag [/] )true must be the last write to flag [/] before i enters its 
<critical section>. By transitivity, this write is the most recent write to flag[/] that 

precedes r{fla.gl,i\)false in j’s view, contradicting the validity of {0\jU0\wj — 
Therefore, Bums’ algorithm satisfies Mutual Exclusion for PC-G. 

Progress: If only one process is participating, then it will enter the <critical 

section>. So assume m processes, 2 < m < tz, are participating in a computation of 
Bums’ Algorithm such that none of them is able to progress to <critical section>. Wq 
show this is impossible. By PC-G, all processes must agree of the order of the writes to 
turn, and eventually m — 1 of them will see turn different from their own identifiers; 
therefore, all m — 1 processes enter the body of the while loop. At least one process will 
fail the test on line 4 skipping the while loop. This is because of the total order on the 
writes to turn that all processes agree on. Since there is at least one process, say j, that 
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does not engage in the while loop, we must have the following, where i ^ j: 

w(turn)/-^w(turn) 7 -^r(turn) j. 

Since w(f lag [ j] )true precedes w(turn) j in program order, we conclude: 

w(f lag [j] )true^r{fl3ig [j] )true. 

Therefore, lines 7 and 8 are unreachable for i unless j makes progress to <exit>. So, 
i is repeatedly executing lines 4 and 5 and w(f lag[/] )false of line 5 must eventually 

appear in {0\j U 0\wj — and consequently j enters its <critical section>. 

Fairness: To see that Bums’ algorithm is unfair for PC-G, we show it’s unfair 

even for SC.^ Consider the Computation 5 which represents a starvation scenario, where 
the segments enclosed by square brackets can be repeated indefinitely. 



i : 



Computations < 



J • 



[w(f lag [/] )true w(turn)/ r(turn)/ r(f lag [j] )false 
< critical section > w{fla.gl,i\ )false\ 
w(f lag [j] )true w(turn) j [r(turn)/ w(f lag [j] )false 
r{fl^gU\)true\ 



The following is an SC linearization. (G, -^) = {wj{fl^gij'] )true Wy(turn)j 
[wi{flBLgli\)true w/(turn)/ r/(turn)/ ry(turn)/ wy(f lag[j] ry(f lag[/] 
ri{fl^Lgijy)false < critical section > w/(f lag [/] )/a/^^]). Operations are subscripted 
by the corresponding process id. The segment enclosed in square brackets is the part of 
the computation being repeated indefinitely. 



5 Summary 

PC-G is a consistency model that satisfies both Pipelined-RAM consistency and Coher- 
ence. Furthermore, for each process, there must be a single linearization that meets both 
requirements simultaneously. Even the slight relaxation to a consistency model that is 
the intersection of both Pipelined-RAM and Coherence (but which permits distinct lin- 
earizations for each requirement) is too weak to support a solution to CSP without using 
stronger objects than simple variables (even unbounded ones). This can be proved with 
techniques similar to ones used here [7]. Thus, PC-G appears to be the weakest mem- 
ory consistency model in the literature that has a solution to CSP using only reads and 
writes to shared variables. 

Any solution to CSP(n) for PC-G must use at least one multi-writer and n single- 
writers. Bums’ algorithm, which uses one multi-writer and n single-writers and is cor- 
rect for PC-G, establishes that this bound is tight. But Bums’ algorithm is unfair. Pe- 
terson’s algorithm for two processes, which uses one multi-writer and 2 single writers 

^ It is well known that Burns’ algorithm is unfair even for SC. 
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and is correct and fair for PC-G, shows that this lower bound is tight even for fair solu- 
tions when w = 2. It is not clear to us yet whether a fair solution for n processes can be 
constructed using only one multi-writer and n single- writers. If not, then to tighten the 
lower bound in the general case, impossibility proofs will have to exploit fairness. Many 
other algorithms that use the same number and type of variables as Bums’ have been 
shown to fail for PC-G. Finally, Peterson’s algorithm, which uses n — \ multi-writers 
and n single-writers, is correct and fair for PC-G. 
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A CSP Algorithms 

For each of the following CSP algorithms, processes have unique identifiers from the 
set {Op-^n—l}, where n is the total number of processes. The algorithms are given 
by specifying the <entry> and <exit> sections of process i,i C {Op-^n—l}. 
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Peterson’s Algorithm Dijkstra’s Algorithm 

flagCO „ n — 1 ] in {"”1 „ n — 2] flagCO „ n—Y} in {idle, requesting, in-cs} 

tnrnCO .. « — 2] in {0 .. n — \{ turn in 1} 



<entry> 

for /: = 0 to n — 2 do 

flagCO ^ k 
tnrn[/c] 4- i 

while flagCj] >k and 

tnrn[/c] = 0 do nothing 

<critical section> 

<exit> 

flagCO ^ 1 



Dekker’s Algorithm (2 processes) 

flagCO „ 1] in {true, false} 
turn in { 0 , 1 } 

<entry> 

flagCO 4 - true 
while (flagCjl) do 

if (turn = j) then 
flagCO false 
while (turn = j) do nothing 
flagCO 4 - true 
end-if 
end-while 

<critical section> 

<exit> 
turn 4— j 
flagCO false 



Eisenberg and MacGuire’s Algorithm 

flagCO .. n — 1] in {idle, requesting, in-cs} 
turn in { 0 , ••• ,n — 1} 

<entry> 

repeat 

flagCO requesting 
j 4— turn 
while ij 0 do 

if (flagCyO idle) then 
j 4— turn 

else j 4— (y + l) mod n 
end-while 

flagCO 4- in-cs 

until (CVyT^O flagC;] in-cs) and 

(turn = i or flagCtnrn] = idle)) 
turn 4— i 

<critical section> 

<exit> 

j 4— (tnrn+1) mod n 
while (7 7 ^ turn and flagCy] = idle) do 
j ^ (; + l) mod n 

end-while 

turn 4— j 

flagCO 4- idle 



<entry> 

repeat 

flagCO 4- requesting 
while (turn 7 ^ i) do 

if (flagCtnrn] = idle) then 
turn 4— i 
end-while 
flagCO 4 - in-cs 
until flagCjO 7 ^ in-cs) 

<critical section> 

<exit> 

flagCO idle 



De Bruijn’s Algorithm 

flagCO .. n— 1 ] in {idle, requesting, in-cs} 
turn in { 0 , •••,« — 1 } 

<entry> 

repeat 

flagCO requesting 
j 4— turn 
while (7 7^ i) do 

if (flagC 7 ] 7 ^ idle) then 
7 4— turn 

else 7 4- (7™-!) mod n 
end-while 
flagCO 4 - in-cs 
until flagC 7 l 7 ^ in-cs) 

<critical section> 

<exit> 

if (flagCtnrn] = idle and tnrn = i) then 
tnrn 4— (tnrn— 1) mod n 

end-if 

flagCO 4 - idle 



Knuth’s Algorithm 

flagCO „ n— 1 ] in {idle, requesting, in-cs} 
tnrn in { 0 , — 1 } 

<entry> 

repeat 

flagCO 4 - requesting 
j 4— tnrn 
while (7 7^ i) do 

if (flagC70 7^ idle) then 
7 4— tnrn 

else 7 4- ( 7 ™-!) mod n 
end-while 
flagCO in-cs 
until flagCj] 7^ in-cs) 

tnrn 4— i 

<critical section> 

<exit> 

tnrn 4— {i—l) mod n 
flagCO idle 
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Abstract. The computer industry is examining the use of strong syn- 
chronization operations such as double compare- and- swap (DCAS) as a 
means of supporting non-blocking synchronization on tomorrow’s mul- 
tiprocessor machines. However, before such a primitive will be incorpo- 
rated into hardware design, its utility needs to be proven by developing 
a body of effective non-blocking data structures using DCAS. 

In a previous paper, we presented two linearizable non-blocking imple- 
mentations of concurrent deques (double-ended queues) using the DCAS 
operation. These improved on previous algorithms by nearly always al- 
lowing unimpeded concurrent access to both ends of the deque while 
correctly handling the difficult boundary cases when the deque is empty 
or full. A remaining open question was whether, using DCAS, one can 
design a non-blocking implementation of concurrent deques that allows 
dynamic memory allocation but also uses only a single DCAS per push 
or pop in the best case. 

This paper answers that question in the affirmative. We present a new 
non-blocking implementation of concurrent deques using the DCAS op- 
eration. This algorithm provides the benefits of our previous techniques 
while overcoming drawbacks. Like our previous approaches, this imple- 
mentation relies on automatic storage reclamation to ensure that a stor- 
age node is not reclaimed and reused until it can be proved that the 
node is not reachable from any thread of control. This algorithm uses 
a linked-list representation with dynamic node allocation and therefore 
does not impose a fixed maximum capacity on the deque. It does not 
require the use of a “spare bit” in pointers. In the best case (no interfer- 
ence), it requires only one DCAS per push and one DCAS per pop. We 
also sketch a proof of correctness. 



1 Introduction 

In academic circles and in industry, it is becoming evident that non-blocking 
algorithms can deliver significant performance benefits [3,20,17] and resiliency 
benefits [9] to parallel systems. Unfortunately, there is a growing realization 
that existing synchronization operations on single memory locations, such as 
compare- and- swap (CAS), are not expressive enough to support design of ef- 
ficient non-blocking algorithms [9,10,12], and software emulations of stronger 
primitives from weaker ones are still too complex to be considered practical [I, 
4,7,8,21]. In response, industry is currently examining the idea of supporting 
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Array 

with 

centralized 
access 
(see [9]) 


Array 
used as 
circular 
buffer 
(see [2]) 


Linked 
fist with 
tagged 
pointers 
(see [2]) 


Snark 
(with 
garbage 
collection) 
(this paper) 


Left and right accesses interfere 


yes 


no 


no 


no 


Fixed limit on size of deque 


yes 


yes 


no 


no 


Tag bit needed in pointers 


no 


no 


yes 


no 


DCAS ops per unimpeded pop 


1 


1 


2 


1 


DCAS ops per unimpeded push 


1 


1 


1 


1 


Number of reserved values 


1 


1 


3 


0 


Storage allocator calls per push 


0 


0 


1 


1 


Storage overhead per item 


none 


none 


2 pointers 


2 pointers 



Table 1. Comparison of various DCAS-based deque algorithms 



stronger synchronization operations in hardware. A leading candidate among 
such operations is double compare- and- swap (DC AS), a CAS performed atom- 
ically on two memory locations. However, before such a primitive can be incor- 
porated into processor design, it is necessary to understand how much of an 
improvement it actually offers. One step in doing so is developing a body of effi- 
cient data structures and associated algorithms based on the DCAS operation. 

There have recently been several proposed designs for non-blocking lineariz- 
able concurrent double-ended queues {deques) using the double compare- and- 
swap operation [9,2]. Deques, as described in [15] and currently used in load 
balancing algorithms [3], are classic structures to examine, in that they involve 
all the intricacies of LIFO- stacks and FIFO-queues, with the added complexity 
of handling operations originating at both ends of the deque. 

Massalin and Pu [16] were the first to present a collection of DCAS-based con- 
current algorithms. They built a lock-free operating system kernel based on the 
DCAS operation (CAS2) offered by the Motorola 68040 processor, implementing 
structures such as stacks, FIFO-queues, and linked lists. 

Greenwald, a strong advocate for using DCAS, built a collection of DCAS- 
based concurrent data structures improving on those of Massalin and Pu. In the 
best case (no interference from other threads) , his array-based deque algorithms 
required one DCAS per push and one DCAS per pop. Unfortunately, these al- 
gorithms used DCAS in a restrictive way. The first ([9] pp. 196-197) used the 
two-word DCAS as if it were a three-word operation, keeping the two deque 
end pointers in the same memory word, and DCAS-ing on it and a second word 
containing a value; this prevents truly concurrent, noninterfering access to the 
two deque ends. The second algorithm ([9] pp. 219-220) assumed an array of 
unbounded size, and did not correctly detect when the deque is full in all cases. 

Arora et al. [3] present an elegant CAS-based restricted deque with applica- 
tions in job-stealing algorithms. This non-blocking implementation needs only a 
single CAS operation since it restricts one side of the deque to be accessed by 
only a single processor, and the other side to allow only pop operations. 
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In a recent paper [2], we presented two new linearizable non-blocking imple- 
mentations of concurrent deques using the DCAS operation. One used an array 
representation, and improved on previous algorithms by allowing uninterrupted 
concurrent access to both ends of the deque while correctly handling the diffi- 
cult boundary cases when the deque is empty or full. In the best case, this array 
technique required one DCAS per push and one DCAS per pop. A drawback of 
the array representation was that it imposed a fixed maximum capacity on the 
queue. The second implementation corrected this by using a dynamic linked-list 
representation, and was the first non-blocking unbounded- memory deque imple- 
mentation. Drawbacks of this list-based implementation were that it required a 
“spare bit” in certain pointers to serve as a boolean flag and that it required at 
least two (amortized) DCAS operations per pop. 

A remaining open question was whether, using DCAS, one can design a 
non-blocking implementation of concurrent deques that allows dynamic memory 
allocation, as in the linked-list algorithms of [2], but also uses only a single DCAS 
per push or pop in the best case, as in array-based algorithms [2, 9]. This paper 
answers that question in the affirmative. Table 1 outlines the characteristics of 
the various algorithms. The first six rows indicate that the algorithm presented 
in this paper avoids drawbacks of previous work. 

2 Modeling DCAS and Deques 

Our computation model follows [5, 6, 14] as well as our own previous paper [2]. 
A concurrent system is a collection of n processors^ which communicate through 
shared data structures called objects. Each object provides a set of primitive 
operations that are the only means of manipulating that object. Each processor 
is a thread of control [14] that sequentially invokes object operations by issuing 
an invocation and then receiving the associated response before issuing the next 
invocation. A thread behavior is the entire set of invocations and associated 
responses associated with a single thread; this set is totally ordered in time 
according to the order in which the thread issued and received the invocations 
and responses. A system behavior is the (disjoint) union of the thread behaviors 
of all the threads in a concurrent system. 

A history is a system behavior upon which a total order has been imposed 
on invocations and responses that is consistent with the orderings of the thread 
behaviors. Each history may be regarded as a “real-time” order of operations 
where an operation A is said to precede another operation B if AA response occurs 
before BA invocation. Two operations are concurrent if they are unrelated by 
the real-time order. When we reason about the possible behaviors of a system or 
a thread within that system, we typically try to characterize the set of possible 
histories of the system. 

A sequential history is a history in which each invocation is followed immedi- 
ately by its associated response. The sequential specification of an object is a set 
of permitted sequential histories. The basic correctness requirement for a con- 
current implementation is linearizability [14]: for every history H that may be 
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realized by the system, there exists a sequential history that is in the intersection 
of the sequential specifications of all the objects in the system and whose total 
order of operations is consistent with the iJ’s partial order of operations. In a 
linearizable implementation, each operation appears to take effect atomically at 
some point between its invocation and its associated response. 

In our model, every shared memory location L of a multiprocessor machine’s 
memory is a linearizable implementation of an object that provides every pro- 
cessor Pi with a set of sequentially specified machine operations (see [11, 13]): 

Readi{kL) reads location L and returns its value. 

Writei{kL^v) writes the value v to location L. 

DCASi{kLl^kL2^ol^o2^nl^n2) is a double- compare- and- swap operation with 
the semantics described below. 

(The address operator k is used to pass the address of a location to an operation.) 
Because we assume a linearizable implementation, we can, in effect, assume that 
these operations are atomic when reasoning about programs that use them. 

For the purposes of this paper, when we write code in a high-level language, 
we assume that each field of a high- level- language object and each global variable 
may be treated as a shared memory location. A simple reference to such a field 
or variable is a Read operation; a simple assignment to such a field or variable is 
a Write operation; and a method or subroutine called DCAS is used to perform 
the DCAS operation on two fields or variables. 

The implementation we present is non-hloeking (also called loek-free) [13]. Let 
us use the term higher-level operations to refer to operations of an object being 
implemented, and lower-level operations to refer to the (machine) operations in 
terms of which it is implemented. A non-hloeking implementation is one for which 
any history that has invocations of some set O of higher-level operations but no 
associated responses may contain any number of responses for high-level opera- 
tions concurrent with those in O. That is, even if some higher-level operations 
(each of which may be continuously taking steps, or not) never complete, other 
invoked operations may nevertheless continually complete. Thus the system as a 
whole can make progress; individual processors cannot be blocked, only delayed, 
by other processors continuously taking steps or failing to take steps. Using locks 
would violate the above condition, hence the alternate name loek-free. 

Figure 1 contains code for the DCAS operation; for comparison, it also shows 
code for the simpler CAS operation (which is not used in the algorithms pre- 
sented here). For either operation, the sequence of suboperations is assumed to 
be executed atomically, either through hardware support [12, 18, 19] or through 
a non-blocking software emulation [7, 21]. 

A CAS operation examines one memory location and compares its contents 
to an expected “old” value. If the contents match, then the contents are replaced 
with a specified “new” value and an indication of success is returned; otherwise 
the contents are unchanged and an indication of failure is returned. 

A DCAS operation may be viewed as two yoked CAS operations: mismatch 
in either causes both to fail. (Note: the algorithms in this paper do not require 
the overloaded versions of DCAS that we used in our previous paper [2].) 




Even Better DCAS-Based Concurrent Deques 



63 



boolean CAS(val *addr, 
val old, 
val newl) ■[ 
atomically ■[ 

if (*addr == old) { 
*addr = new; 
return true ; 

} else return false; 

} 

} 



boolean DCAS(val *addrl, val *addr2, 
val oldl , val old2, 
val newl , val new2) ■[ 
atomically ■[ 

if ((*addrl == oldl) && 

(*addr2 == old2) ) { 

*addrl = newl; 

*addr2 = new2; 
return true ; 

} else return false; 

} 

} 



Fig. 1. Single and Double Compare- and- Swap Operations 



We assume that a CAS operation is substantially more expensive than a 
simple read or write of a shared variable, and that a DCAS is rather more 
expensive than a CAS. We also assume that memory operations (Read, Write, 
DCAS) that operate on distinct locations can be carried out concurrently, but 
those that operate on the same location are carried out sequentially, so there is a 
potential performance advantage in, for example, avoiding having operations on 
one end of a deque touch variables associated with the other end of the deque. 

A deque 5 is a concurrent shared object created by a makeDeque (length) 
operation that allows each processor to perform one of four types of operations 
on S’. pushRight, popRight, pushLeft, and popLeft. 

We require that a concurrent implementation of a deque object be one that 
is linearizable to a standard sequential deque of the type described in [15]. 

The state of a deque is a sequence of items S = (uq, . . . , 'C/c) having cardinality 
I S' I where 0 < 1^1 < length. A deque is initially empty, that is, has cardinality 
0. A deque is said to be full when its cardinality is length. (For the purposes 
of this paper, the length of the deque is essentially the total amount of storage 
available for allocation as deque node objects.) 

The four possible push and pop operations induce the following state transi- 
tions of the sequence S = (uq, . . . , 'C/c), with appropriate returned values: 

— pushRight (unew), if S is not full, changes S to be . . . ^Vk^Vnew) and 
returns “okay” ; if 5 is full, it returns “full” and S is unchanged. 

— pushLeft (unew), if S is not full, changes S to be ('Cnew^'^o, • • • ,'^/c) and re- 
turns “okay” ; if 5 is full, it returns “full” and S is unchanged. 

— popRight(), if S is not empty, changes S to be (uq, . . . ^Vk-i) and returns 
u/c; if 5 is empty, it returns “empty” and S is unchanged. 

— popLeft (), if S is not empty, changes S to be (ui, . . . , 'C/c) and returns 'Cq; if 
S is empty, it returns “empty” and S is unchanged. 

For example, starting with an empty deque 5 = (), pushRight (1) changes 
the state to 5 = (1); pushLeft (2) transitions to 5 = (2, 1); then pushRight (3) 
transitions to 5 = (2,1,3). A subsequent popLeft() transitions to 5 = (1,3) 
and returns 2; then popLeft () transitions to 5 = (3) and returns 1 (which had 
been pushed from the right). 
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3 The ^^Snark” Linked-list Deque 



Our implementation (we have arbitrarily nicknamed it Snark) represents a deque 
as a doubly-linked list of nodes. Each node in the list contains two link pointers 
R and L and a value V (see Figure 2 below). There are two global “anchor” 
variables, arbitrarily called LeftHat and RightHat (lines 7-8), which generally 
point to the leftmost node and the rightmost node in the chain. 

A node whose L field points to that same node is said to be left- dead; a node 
whose R field points to that same node is said to be right-dead. If LeftHat points 
to a node that is not left-dead, then the L field of that node points to a right- 
dead node; if RightHat points to a node that is not right-dead, then the R field 
of that node points to a left-dead node. As we will see, LeftHat points to a left- 
dead node if and only if RightHat points to a right-dead node; such a situation 
represents a deque with no items in it. The special node Dummy is both left-dead 
and right-dead (lines 6-7); as we will see, no other node is ever both left-dead 
and right-dead. In all cases, once a node becomes left-dead, it remains left-dead 
(until the node is determined to be inaccessible and therefore eligible to be 
reclaimed); once a node becomes right-dead, it remains right-dead. These rules 
may seem somewhat complicated, but they lead to a uniform implementation of 
pop operations. 

A typical deque, with values A, 5, C, and D in it, looks like this: 



LeftHat Q Q Righ1 



B 



C\ 






RightHat 



where ? indicates a “don’t care” pointer or value. An empty deque looks like: 
LeftHat RightHat LeftHat RightHat 



Dummy 



— 




? 




>1 


rr 


? 

? 


c± 


? 


a special case of which is 


(± 


? 



Figures 3 and 4 show non-blocking implementations of push and pop opera- 
tions on the right-hand end of the deque. We describe these operations in detail. 
The left- hand- side operations shown in Figures 5 and 6 are symmetric. 

The right-side push operation first obtains a fresh Node structure from the 
storage allocator (Figure 3, line 2). (Note that the problem of implementing a 



1 structure Node { 

2 Node *R; 

3 Node *L; 

4 val V; } 



5 Node Dummy = new NodeO; 

6 Dummy. L = Dummy. R = Dummy; 

7 Node *LeftHat = Dummy; 

8 Node *RightHat = Dummy; 



Fig. 2. The array-based deque — data structure and hats (anchors). 
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1 val pushRight (val v) { 

2 nd = new NodeO; /* Allocate new Node structure */ 



3 


if (nd == null) return 


"full"; 






4 


nd->R = Dummy; 








5 


nd->V = v; 








6 


while (true) ■[ 








7 


rh = RightHat; 


/* 


Labels A, B, 


*/ 


8 


rhR = rh->R; 


/* 


etc . , are used 


*/ 


9 


if (rhR == rh) { 


/* 


in the proof 


*/ 


10 


nd->L = Dummy; 


/* 


of correctness 


*/ 


11 


Ih = LeftHat; 








12 


if (DCAS(&RightHat , 


feLeftHat, rh, Ih, nd, nd)) 


/* A 


*/ 


13 


return "okay"; 








14 


} else { 








15 


nd->L = rh; 








16 


if (DCAS(&RightHat , 


&rh->R, rh, rhR, nd, nd)) 


/* B 


*/ 


17 


return "okay"; 









18 } } } // Please forgive this brace style 

Fig. 3. Simple linked- list deque — right- hand- side push. 



non-blocking storage allocator is not addressed in this paper, but would need 
to be solved to produce a completely non-blocking deque implementation.) We 
assume that if allocatable storage has been completely exhausted (even after 
automatic reclamation has occurred), the new operation will yield a null pointer; 
the push operation treats this as sufficient cause to report that the deque is full 
(line 3). Otherwise, the R field of the new node is made to point to Dummy (line 4) 
and the value to be pushed is stored into the V field (line 5); all that remains is 
to splice this new node into the doubly-linked chain. But an attempt to splice 
might fail (because of an action by some other concurrent push or pop), so a 
“while true” loop (line 6) is used to iterate until a splice succeeds. 

The Right Hat is copied into local variable rh (line 7) — this is important. If 
rh points to a right-dead node (line 9), then the deque is empty. In this case, 
the new node should become the only node in the deque. Its L field is made to 
point to Dummy (line 10) and then a DCAS is used (line 12) to atomically make 
both Right Hat and Left Hat point to the new node — but only if neither hat has 
changed. If this DCAS succeeds, then the push has succeeded (line 13); if the 
DCAS fails, then control will go around the “while true” loop to retry. 

If the deque is not empty, then the new node must be added to the right-hand 
end of the doubly-linked chain. The copied content of the RightHat is stored 
into the L field of the new node (line 15) and then a DCAS is used (line 16) to 
make both the RightHat and the former right-end node point to the new node, 
which thus becomes the new right-end node. If this DCAS operation succeeds, 
then the push has succeeded (line 17); if the DCAS fails, then control will go 
around the “while true” loop to retry. 

The right-side pop operation also uses a “while true” loop (line 2) to iterate 
until an attempt to pop succeeds. The RightHat is copied into local variable rh 
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1 val popRightO { 

2 while (true) { 

3 rh = RightHat; 

4 Ih = LeftHat; 

5 if (rh->R == rh) return "empty" 

6 if (rh == Ih) { 

7 if (DCAS (feRightHat , feLeftHat, rh, Ih, Dummy, Dummy)) 

8 return rh->V ; 

9 } else { 

10 rhL = rh->L; 

11 if (DCAS (feRightHat , ferh->L, rh, rhL, rhL, rh)) { 

12 result = rh->V; 

13 rh->R = Dummy; 

14 rh->V = null; /* optional (see text) */ 

15 return result; 

16 } } } } // Stacking braces this way saves space 

Fig. 4. Simple linked-list deque — right- hand- side pop. 



// Delicate order of operations 
// here (see proof of Theorem 4 
// and the Conclusions section) 



/* C */ 

/* D */ 
/* E */ 



(line 7) — this is important. If rh points to a right-dead node, then the deque is 
empty and the pop operation reports that fact (line 4). 

Otherwise, there are two cases, depending on whether there is exactly one 
item or more than one item in the deque. There is exactly one item in the 
deque if and only if the LeftHat and RightHat point to the same node (line 6). 
In that case, a DCAS operation is used to reset both hats to point to Dummy 
(line 7); if it succeeds, then the pop succeeds and the value to be returned is 
in the V field of the popped node (line 8). (It is assumed that, after exit from 
the popRight routine, the node just popped will be reclaimed by the automatic 
storage allocator, through garbage collection or some such technique.) 

If there is more than one item in the deque, then the rightmost node must 
be removed from the doubly- linked chain. A DCAS is used (line 11) to move the 
RightHat to the node to the immediate left of the rightmost node; at the same 
time, the L field of that rightmost node is changed to contain a self-pointer, thus 
making the rightmost node left-dead. If this DCAS operation fails, then control 
will go around the “while true” loop to retry; but if the DCAS succeeds, then the 
pop succeeds and the value to be returned is in the V field of the popped node. 
Before this value is returned, the R field is cleared (line 13) so that previously 
popped nodes may be reclaimed. It may also be desirable to clear the V field 
immediately (line 14) so that the popped value will not be retained indefinitely 
by the queue structure. If the V field does not contain references to other data 
structures, then line 14 may be omitted. 

The push and pop operations work together in a completely straightforward 
manner except in one odd case. If a popRight operation and a popLef t operation 
occur concurrently when there are exactly two nodes in the deque, then each 
operation may (correctly) discover that LeftHat and RightHat point to different 
nodes (line 6 in each of Figures 4 and 6) and therefore proceed to perform a 
DCAS for the multinode case (line 11 in each of Figures 4 and 6). Both of these 
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1 val pushLeft(val v) ■[ 



2 


nd 


= new NodeO; /* Allocate new Node structure */ 








3 


if 


(nd == null) return 


"full"; 








4 


nd 


->L = Dummy; 










5 


nd 


> 

II 

> 

A 

1 










6 


while (true) ■[ 










7 




Ih = Lef tHat; 










8 




IhL = lh->L; 










9 




if (IhL == Ih) { 










10 




nd->R = Dummy; 










11 




rh = RightHat ; 










12 




if (DCAS (&Lef tHat, 


feRightHat , Ih, rh, nd, nd)) 


/* 


A’ 


*/ 


13 




return "okay"; 










14 




} else { 










15 




nd->R = Ih; 










16 




if (DCAS (&Lef tHat, 


&lh->L, Ih, IhL, nd, nd) ) 


/* 




*/ 


17 




return "okay"; 










18 } 


} 


} 


// We were given a firm limit 


of 15 


pages 



Fig. 5. Simple linked-list deque — left- hand- side push. 



DCAS operations may succeed, because they operate on disjoint pairs of memory 
locations. The result is that the hats pass each other: 



Lef tHat Q Q Righi 



RightHat 



Lef tHat RightHat 



— 


:p 




— 


JL 


— 


? 




— 


:p 


— 


:p 


— 


— 


? 


? 




- 




- 


(*: 


- 


becomes 


? 




- 




- 


c>: 


- 


? 




A 




B 




? 




? 




A 




B 




? 



But this works out just fine: there had been two nodes in the deque and both have 
been popped, but as they are popped they are made right-dead and left-dead, 
so that the deque is now correctly empty. 



4 Sketch of Correctness Proof for the ^^Snark” Algorithm 

We reason on a state transition diagram in which each node represents a class of 
possible states for the deque data structure and each transition arc corresponds 
to an operation in the code that can modify the data structure. For every node 
and every distinct operation in the code, there must be an arc from that node 
for that operation unless it can be proved that, when the deque is in the state 
represented by that node, either the operation must fail or the operation cannot 
be executed because flow of control cannot reach that operation with the deque 
in the prescribed state. 

The possible states of a Snark deque are shown in the following state transi- 
tion diagram: 
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1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 



val popLeftO ■[ 
while (true) ■[ 

Ih = LeftHat; // Delicate order of operations 

rh = RightHat; // here (see proof of Theorem 4 

if (lh->L == Ih) return "empty"; // and the Conclusions section) 

if (Ih == rh) { 

if (DCAS (feLeftHat , feRightHat, Ih, rh. Dummy, Dummy)) /* */ 

return lh->V; 

} else { 

IhR = lh->R; 

if (DCAS (feLeftHat, felh->R, Ih, IhR, IhR, Ih)) { /* D^ */ 

result = lh->V; 

lh->L = Dummy; /* */ 

lh->V = null; /* optional (see text) */ 
return result; 

} } } } // Better to stack braces than to omit a lemma 



Fig. 6. Simple linked- list deque — left-hand-side pop. 




The rightmost node shown actually represents an infinite set of nodes, one 
for each integer n for n > 1, where there are n + 2 items in the deque. The labels 
on the transition arcs correspond to the labels on operations that modify the 
linked-list data structure in Figures 3, 4, 5, and 6. The labels B+ and B^ + indicate 
a transition that increases n by 1; the labels D- and D^- indicate a transition 
that decreases n by 1 . We will also use labels such as A and A ^ in the text that 
follows to refer to DCAS and assignment operations in those figures. 

We say that a node is “in the deque from the left” if it is not left-dead and 
it is reachable from the node referred to by the LeftHat by zero or more steps 
of following pointers in the L field. We say that a node is “in the deque from the 
right” if it is not right-dead and it is reachable from the node referred to by the 
RightHat by zero or more steps of following pointers in the R field. 

The Snark algorithm is proved correct largely by demonstrating that, for 
every DCAS operation and every possible state of the deque data structure, if 
the DCAS operation succeeds then a correct transition occurs as shown in the 
state diagram. In cases where there is no corresponding arc on the state diagram, 
it is necessary to prove either that the DCAS cannot succeed if the deque is in 
that state or that control cannot reach the DCAS with the deque in that state. 
Here we provide proofs only for these latter cases. 
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Lemma 1. A node is in the deque from the left if and only if it is in the deque 
from the right (therefore from now on we may say simply ‘Hn the deque^^). 

Lemma 2. If a node is in the deque and then is removed, thereafter that node 
is never in the deque again. 

Lemma 3. No node exeept the Dummy node is ever both left- dead and right- dead. 

Proof. Initially, only the Dummy node exists. Inspection of the code for pushRight 
and pushLef t shows that newly created nodes are never made left-dead or right- 
dead. Only operation D ever makes an existing node right-dead, and only oper- 
ation D^ ever makes an existing node left-dead. But D and D^ each operate on 
a node that is in the deque, and as it makes a node left-dead or right-dead, it 
removes it from the deque. By Lemma 2, a node cannot be removed twice. So 
the same node is never made right-dead by D and also made left-dead by D C I 

Lemma 4. No node is ever made left-dead or right-dead after the node is re- 
moved from the deque. 

Proof. By Lemma 2, after a node is removed from the deque it is never in the 
deque again. Only operation D ever makes an existing node right-dead, and only 
operation D ^ ever makes a node left-dead. But each of these operations succeeds 
only on a node that is in the deque. I 

Lemma 5. Once a node is right- dead, it stays right- dead as long as it is reach- 
able from any thread. 

Proof. Only operations B, Db and E change the R field of a node. But B succeeds 
only if the node referred to by rh is not right-dead, and D always makes the 
node referred to by rh right-dead. Operation E always stores into the R field 
of a node that has been made left-dead as it was removed from the deque. By 
Lemma 3, the node was not right-dead when it was removed from the deque; 
by Lemma 4, the node cannot become right-dead after it was removed from the 
deque. Therefore when operation E changes the R field of a node, that node is 
not right- dead. I 

Lemma 6. Once a node is left- dead, it stays left- dead as long as it is reachable 
from any thread. 

Lemma 7. The Right Hat points to a right- dead node if and only if the deque 
is empty, and the Lef tHat points to a left-dead node if and only if the deque is 
empty. 

Proof. Initially both RightHat and LeftHat point to the Dummy node, so this 
invariant is initially true. Operations A and make both RightHat and LeftHat 
point to a new node that is not left-dead or right-dead, so the deque is not empty. 
Operation B can succeed only if the RightHat points to a node that is not right- 
dead, and it changes RightHat to point to a new node that is not right-dead. A 
symmetric remark applies to B L Operations C and C ^ make both RightHat and 
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Left Hat point to the Dummy node, which is both left-dead and right-dead, so the 
deque is empty. If operation D moves RightHat from a node that it not right- 
dead to a node that is right-dead, then the deque had only one item in it; then 
the Lef tHat also points to the node just removed from the deque by operation D, 
and that operation, as it moved the RightHat and emptied the deque, also made 
the node left-dead. A symmetric remark applies to D\ Operations E and do 
not change whether a node is left-dead or right-dead (see proof of Lemma 5). I 

Theorem 1. Operation A fails if the deque is not empty. 

Proof. Operation A is executed only after the node referred to by rh has been 
found to be right-dead. By Lemma 5, once a node is right-dead, it remains right- 
dead. Therefore, if the deque is non-empty when A is executed, then RightHat 
must point to some other node than the one referred to by rh; therefore RightHat 
does not match rh and the DCAS must fail. I 

Theorem 2. Operation B fails if the deque is empty. 

Proof. Operation B is executed only after rhR has been found unequal to rh. 
If the deque is empty when B is executed, and RightHat equals rh, then the 
node referred to by rh must have become right-dead; but that means that rh->R 
equals rh, and therefore cannot match rhR, and so the DCAS must fail. I 

Theorem 3. Operation C fails unless there is exaetly one item in the deque. 

Proof. When C is executed, rh equals Ih, so C can succeed only when RightHat 
and Lef tHat point to the same node. If the deque has two or more items in it, 
then RightHat and LeftHat contain different values, so the DCAS must fail. 

If the deque is empty, and RightHat and LeftHat point to same node, then 
by Lemma 7 that node must be both left-dead and right-dead, and by Lemma 3 
that node must be the Dummy node, which is created right-dead and (by Lemma 5) 
always remains right-dead. But then the test in line 5 of popRight would have 
prevented control from reaching operation C. Therefore, if C is executed with the 
deque empty, RightHat and LeftHat necessarily contain different values, so the 
DCAS must fail. I 

Theorem 4. Operation D fails if the deque is empty. 

Proof. This is the most difficult and delicate of our proofs. Suppose that some 
thread of control T is about to execute operation D. Then T, at line 3 of 
popRight, read a value from RightHat (now in T’s local variable rh) that pointed 
to a node that was not right-dead when T executed line 5; therefore the deque 
was not empty at that time. Also, T must have read a value from LeftHat in 
line 4 that turned out not to be equal to rh when T executed line 6. 

Now suppose, as T executes D in line 12, that the deque is empty. How 
might the deque have become empty since T executed line 5? Only through the 
execution of C or or D or D^ by some other thread U. If U executed C or D, 
then it changed the value of RightHat; in this case T’s execution of DCAS D 
must fail, because RightHat will not match rh. 
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So consider the case that U executed or DC (Note that, for the execution 
of D by T to succeed, there cannot have been another thread that performed 
a C ^ or D ^ after U but before T’s execution of DCAS D ^ , because that would 
require a preceding execution of A or of A b either of which would have changed 
RightHat, causing T’s execution of DCAS D to fail.) 

Now, if U executed C ^ , then U changed the value of RightHat (to point to 
Dummy); therefore T’s execution of DCAS D must fail. 

If, on the other hand, U executed D^ to make the deque empty, then the 
deque must have had one item in it when U executed DCAS DC But thread U 
read values for Left Hat (in line 3 of popLeft) and RightHat (in line 4) that 
were found in line 6 not to be equal. Therefore, when U read RightHat in line 4, 
either the deque did not have exactly one item in it or the value of Left Hat had 
been changed since U read Left Hat in line 3. If Left Hat had been changed, then 
execution of D^ by 77 would have to fail, contrary to our assumption. Therefore, 
if there is any hope left for execution of D ^ by 77 to succeed, the deque must not 
have had exactly one item in it when 77 read RightHat in line 4. 

How, then, might the deque have come to hold exactly one item after 77 
executed line 4? Only through some operation by a third thread. If that operation 
was A^ or or or DC that operation must have changed LeftHat; but that 
would cause the execution of DCAS D ^ by 77 to fail, contrary to our assumption. 
Therefore the operation by a third thread must have been A or B or C or D. 
Consider, then, the most recent execution (relative to the execution of D by T) 
of DCAS A or B or C or D that caused the deque to contain exactly one item, and 
let V be the thread that executed it. (It is well-defined which of these DCAS 
executions is most recent because DCAS operations A, B, C, and D all synchronize 
on a common variable, namely RightHat.) 

If this DCAS operation by thread V occurred after thread T read RightHat 
in line 3, then it changed RightHat after T read RightHat, and the execution 
of DCAS D by T must fail. Therefore, if there is any hope left for execution of D 
by T to succeed, then execution of the most recent DCAS A or B or C or D (by 
V) must have occurred before T read RightHat in line 3. 

To summarize the necessary order of events: (a) 77 reads LeftHat in line 3 
of popLeft; (b) V executes A or B or C or D, resulting in the deque containing 
one item; (c) T executes lines 3, 4, 5, and 6 of popRight; (d) 77 executes D^; 
(e) T executes D. Moreover, there was no execution of A or B or C or D by any 
other thread after event (b) but before event (e), and there cannot have been 
any execution of A ^ or B ^ or C ^ or D ^ after event (a) but before event (d) . 

Therefore the deque contained exactly one item during the entire time that 
T executed lines 3 though 6 of popRight. But if so, the test in line 6 would have 
prevented control from reaching D. 

Whew! We have exhausted all possible cases; therefore, if DCAS D is executed 
when the deque is empty, it must fail. I 

Theorem 5. Operation E always succeeds and does not change the number of 
items in the deque. 

Symmetric theorems apply to operations AC B b C b and D ^ . 
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Space limitations prevent us from presenting a proof of linearizability and a 
proof that the algorithms are non-blocking — ^that is, if any subset of the proces- 
sors invoke push or pop operations but fail to complete them (whether the thread 
be suspended, or simply unlucky enough never to execute a DC AS successfully), 
the other processors are in no way impeded in their use of the deque and can 
continue to make progress. However, we observe informally that a thread has 
not made any change to the deque data structure (and therefore has not made 
any progress visible to other threads) until it performs a successful DCAS, and 
once a thread has performed a single successful DCAS then, as observed by other 
threads, a push or pop operation on the deque has been completed. Moreover, 
each DCAS used to implement a push or pop operation has no reason to fail 
unless some other push or pop operation has succeeded since it was invoked. 

5 Conclusions 

We have presented non-blocking algorithms for concurrent access to a double- 
ended queue that supports the four operations pushRight, popRight, pushLeft, 
and popLeft. They depend on a multithreaded execution environment that sup- 
ports automatic storage reclamation in such a way that a node is reclaimed only 
when no thread can possibly access it. Our technique improves on previous meth- 
ods in requiring only one DCAS per push or pop (in the absence of interference) 
while allowing the use of dynamically allocated storage to hold queue items. 

We have two remaining concerns about this algorithm and the style of pro- 
gramming that it represents. First, the implementation of the pop operations is 
not entirely satisfactory because a popRight operation, for example, necessarily 
reads Left Hat as well as Right Hat, causing potential interference with pushLeft 
and popLeft operations even when there are many items in the queue, which in 
hardware implementations of interest could degrade performance. 

Second, the proof of correctness is complex and delicate. While DCAS op- 
erations are certainly more expressive than CAS operations, and can serve as a 
useful building block for concurrent algorithms such as the one presented here 
that can be encapsulated as a library, after our experience we are not sure that 
we can wholeheartedly recommend DCAS as the synchronization primitive of 
choice for everyday concurrent applications programming. In an early draft of 
this paper, we had transposed lines 4 and 5 of Figure 4 (and similarly lines 4 
and 5 of Figure 6); we thought there was no need for popRight to look at the 
Left Hat until the case of an empty deque had been disposed of. We were wrong. 
As we discovered when the proof of Theorem 4 would not go through, that ver- 
sion of the code was faulty, and it was not too difficult to construct a scenario 
in which the same node (and therefore the same value) could be popped twice 
from the queue. As so many (including ourselves) have discovered in the past, 
when it comes to concurrent programming, intuition can be extremely unreli- 
able and is no substitute for careful proof. While we believe that non-blocking 
algorithms are an important strategy for building robust concurrent systems, we 
also believe it is desirable to build them upon concurrency primitives that keep 
the necessary proofs of correctness as simple as possible. 
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Abstract. This paper deals with the implementation of an English auc- 
tion on a distributed system. We assume that all messages are restricted 
to bids and resignations (referred to as the limited communication as- 
sumption) and that all participants are trying to maximize there gains 
(referred to as the prudence assumption). We also assume that bidders 
are risk-neutral, and that the underlying communication network is com- 
plete, asynchronous and failure- free. Under these assumptions, we show 
that the time and communication requirements of any auction process 
are Q(M 2 ) and Q{M '2 + n) respectively, where M 2 denotes the second 
largest valuation of a participant in the auction. 

We then develop a number of distributed algorithmic implementations 
for English auction, analyze their time and communication requirements, 
and propose an algorithm achieving optimal time and communication, 
i.e., meeting the above lower bounds. Einally we discuss extensions to 
the case of dynamically joining participants. 



1 Introduction 

1.1 Background 

The theory of auctions is a well- researched area in the field of economics (cf. 
[5, 6, 8, 9] and the references therein). While auctions come in many different 
forms (such as Dutch Auction, first price sealed bids, Vickrey auction, double 
auction and many other variations), in this paper we focus on the one of the 
most commonly used methods, namely, the English auction. 

An English auction proceeds as follows. The auctioneer (be it the seller or a 
third party, e.g., an auction house) displays an article to be sold and announces 
the reserved priee^ namely, the minimal price the auction begins with (hereafter 
assumed w.l.o.g. to be 0). The auction process consists of bidders making in- 
creasingly higher bids, until no one is willing to pay more. The highest bidder 
buys the article at the proposed price. There may be a minimal increase for each 
bid; for simplicity we assume (w.l.o.g. again) that this minimum increase is 1. 

^ Supported in part by a grant from the Israel Ministry of Science and Art. 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 74-88, 2000. 
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The recent advent of Internet based electronic commerce has led to fast devel- 
opment in the nse of online auctions (cf. [1,4,3] and the references therein). The 
main advantage of online auctioning is that the bidders do not have to attend 
personally; an agent (including an electronic agent) would suffice. This means 
that the sale is not confined to any physical place, and bidders from any part 
of the world are allowed to join the auction. The most popular type of auction 
chosen by the different Internet auction houses is by far the English auction. 

For simplicity, we ignore a number of complicating factors which arise in ac- 
tual auction systems. For instance, the system is assumed to take care of security 
issues. (For a treatment of these issues see [7].) Also, we do not address the issue 
of fault-tolerance, i.e., it is assumed that the auction system is reliable and fault- 
free. Finally, an aspect unique to auctions which will not be dealt with directly 
in this paper concerns potential attempts by the auctioneer or the participants 
to infiuence the outcome of the auction (in lawful and/or unlawful ways). For 
example, the auctioneer may try to raise the offer to the maximum using a shill 
(namely, a covert collaborator) in the auction. In the case of computerized En- 
glish auctioning, that shill does not really have to exist; it can be an imaginary 
bidder imitated by the auctioneer. But, as in live auctioning, the auctioneer may 
end up selling the article to itself. 



1.2 The model 

Let us next describe a model for auctions in a distributed network-based sys- 
tem. The basic components of this model include the underlying communication 
network, the auction system, and the auction protocol. 

The communication network; The underlying communication network is 
represented by a complete n- vertex graph, where the vertices = {ui, . . . ,Un} 
represent the network processors and every two vertices are connected by an 
undirected edges, representing a bidirectional communication channel. 

Communication is asynchronous, and it is assumed that at most one message 
can occupy a communication channel at any given time. Hence, the channel 
becomes available for the next transmission only after the receiving processor 
reads the previous message. 

The auction system: The auction system is formally represented as a pair 

A — {f3jn)j where n is the number of nodes of V hosting the auction partici- 
pants, and (3 is the function assigning valuations to the bidders. Without loss of 
generality we assume that the auctioneer resides in the node A = ui. For sim- 
plicity, it is assumed that each node hosts a single bidder, thus in an n-vertex 
network there may be at most n bidders. In real life situations, a single processor 
(or network node) may host any number of prospective bidders. Our algorithmic 
approach can be easily altered to accommodate such situations. 

The valuation /3(u^) assigned to each participant is a natural number repre- 
senting the maximal offer the participant is willing to bid. These valuations may 
depend on a number of parameters, and in principle may change as the bidding 
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progresses. This might be the case, for instance, if some participants are risk- 
averse, In the current paper we ignore the issue, simply assuming risk-neutral 
participants whose valuation /3(u^) remains constant during the entire bidding 
process. 

The auction process; The behavior of any auction is determined by the 
auction system v4, which can be thought of as the input, and the set of protocols 
Alg used by the participants. An auction process is the execution of a 

given algorithm Alg over a given auction system A, The algorithm Alg consists 
of a set of bidder protocols ALG^ invoked by the participating nodes and a 
special protocol Alga used by the node A hosting the auctioneer. Note that the 
bidder protocols are not required to be identical. This imitates real auctioning, 
since in reality, each participant has its own bidding policy. The behavior of a 
protocol ALG^, executed by v^ relies solely on /3(u^) and the inputs received by 
v^ during the auction process. 

English auctions; As explained earlier, in an English auction the auctioneer 
declares an initial price and the bidders start bidding up until no one offers a 
new bid or the auctioneer decides to stop the auction at a given price. To make 
our model concrete, the following assumptions are postulated on the network 
and the participants. 

The auctioneer will sell the article to the highest bidder. Let M denote the 
set of all the integers that occur as valuations in a given auction system A. 
The members of M are denoted in decreasing order by Mi, M 2 ,. .. and so on. 
Hence assuming the auction is carried to the end, the participant with valuation 
Ml will win the auction. In case there is more then one participant with the 
highest valuation, the first one to bid the maximal offer is the winner. In case of 
a simultaneous bid, some sort of a tiebreaker will determine the winner. 

It is assumed that during the bidding, each offer is final and obligating. Also, 
once a participant has resigned (namely, failed to offer a bid upon request), 
it cannot rejoin the bidding process. Hence the auction starts with a group of 
possible bidders F, and as the auction process progresses, the set P is partitioned 
into two disjoint sets, namely the set of active participants at the beginning of 
round t, denoted APtj and the set of resigned participants,^ denoted RPf, At any 
time t, APt f) RPt — 0 and APt U RPt — P, Of course, both sets change as the 
auction progresses and participants move from APt to RPf, 

The current (highest) offer at the end of round t is denoted by Bt (later on, 
t is sometimes omitted). Initially Bq = 0. On each round t of the execution, the 
auctioneer addresses a set of participants, henceforth referred to as the query set 
of round t. It presents the members of this set with the current bid Ft_i, and 
requests a new, higher bid. Upon receiving the bidding request, each addressed 
participant v^ decides, according to its protocol Alg^,^ whether to commit to a new 
offer F(u^,t) or resign. The auctioneer waits until receiving a reply (in the form 
of a bid B{vjt) or a resignation) from each addressed participant. According to 
these answers, the auctioneer updates Bt and the sets APt and RPtj and decides 
on its next step. In particular, in case there were one or more bidders, the 
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auctioneer appoints one of them as the current winner, denoted WV In contrast, 
if all approached participants have resigned, then the auctioneer will approach 
a new query set in the next round. The process continues until all participants 
but the current winner resign, upon which the auction terminates. 

At any given moment during the bidding process, the current configuration of 
the auction system is described as a tuple Ct — {Bt, ffq, AF^), where Bt denotes 
the current bid, Wt is the current winner (namely, the participant committed 
to Bt) and APt is the set of currently active participants. The initial auction 
configuration is Co = {O^nuU^V"). 

Communications and time complexities; It is assumed that sending a 
message from to a neighbor vj takes one time unit. The minimal data unit 
that needs to be transferred in an auction is the ID of a participant, which 
requires O(logn) bits, and the offer itself, b = F(n,t), which takes log 6 bits. 
Henceforth, we assume that the allowable message size is some value m large 
enough to hold the offer, i.e., m — J?(logn + log 6). (Our results can be readily 
extended to models allowing only fixed-size messages, in the natural way.) 

The time complexity of a given algorithm Alg on a given auction system 
Aj denoted Tjs^j^q{A)j is the number of time units incurred by an execution 
of Alg on A from beginning to completion in the worst case. The 
communication complexity of Alg on A, denoted C"alg(A)? number of 

messages of size m incurred by the execution ri^^^^{A) in the worst case. 



1.3 Our results 

In this paper we initiate the study of distributed implementations for English 
auctions, and their time and communication complexities. Section 2 discusses 
our assumptions and their basic implications. We assume that all messages are 
restricted to bids and resignations. This is referred to as the limited communi- 
cation assumption. We also assume that all participants are trying to maximize 
there gains. This is referred to as the prudence assumption. Under these assump- 
tions, it is shown that the time and communication requirements of any auction 
protocol are Q{M 2 ) and J?(Af 2 +n) respectively. Our main result is presented in 
Section 3, in which we develop a number of distributed algorithmic implementa- 
tions for English auction, analyze their time and communication requirements, 
and propose an algorithm achieving optimal time and communication, i.e., meet- 
ing the above lower bounds. Finally we discuss an extension of our algorithm to 
the case of dynamically joining participants. 

2 Basic properties 

2.1 Assumptions on computerized English auctions 

The assumptions on English auctions stated in Section 1.2 follow the behavior 
of live English auctions, and serve as basic guidelines in any implementation of 
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Englisli auction. In contrast, the set of assnmptions stated next is optional; these 
assnmptions are not always essential bnt they are often natural, and are taken 
in order to facilitate handling an auction in a distributed setting. 

The first assumption is that the auction is limitedj meaning that the data 
sent over the network is limited to participant ID’s accompanied by bids or 
resignations. This restriction adheres to the behavior of live English auctions, 
where bids are the only type of communication allowed between the auctioneer 
and attendees. (Clearly, in reality it is not feasible to enforce this requirement 
since an auction house or participant on the Internet cannot monitor the behavior 
and private communications between any members participating in the auction. 
As mentioned earlier, in the current paper we do not address enforcement issues.) 

It is assumed that none of the participants will exceed its valuation. Com 
versely, it is assumed that a participant will not resign until the current high- 
est bid exceeds its valuation /3(n^). In other words, the bidders are risk-neutral 
This leads to a natural bidder protocol, by which the participant v^ responds 
to each request with a higher bid whenever the current highest offer B satisfies 
B < /3(n^), and a resignation otherwise. 

Note that this is not the only risk-nentral protocol. In case the current offer 
B satisfies B < /3(n^), the bidder v^ may raise the bid to any value between B 
and /3{vi) and still maintain risk-nentrality. We now introduce another reason- 
able assumption (called prudence) ^ and show that under this assumption, a unit 
increment becomes mandatory. The prudence assumption concerns the rational 
behavior of the auctioneer and the bidders. We say that the auction is prudent if 
all participants try to maximize their success. Eor the bidders this means paying 
the minimal price possible. Eor the auctioneer it means receiving the highest 
offer possible. 

Definition 1. A protocol ALG^ is bidder-prudent if it ensures that the bidder 
that wins the auction pays at most M 2 + 1 (where M 2 is the second highest 
valuation among all participants). When more than one participant has valuation 
Ml, the price is at most Mi. 

A protocol AlGa is auctioneer-prudent if it always results with the highest pos- 
sible offer. 

An auction protocol is prudent if it is both auctioneer-prudent and bidder-prudent 

Consequently, when an auction is prudent, the final offer would be exactly 
M 2 + 1 (it may be Mi in case of a tie, or M 2 if the second highest valuation was 
offered by the participant with the highest valuation). 

As mentioned earlier, we do not deal with the question of how prudence can 
be enforced against possible attempts of cheating, by the auctioneer or some of 
the participants. 

Hereafter, we make the following assumptions, for simplicity of presentation. 
Eirst, there is only one participant with the maximal valuation Mi. Secondly, 
M 2 is always offered by someone other than the participant with the maximal 
valuation. Notice that these assumptions are not necessary, and all our results 
may be easily modified to handle the omitted extreme cases. 
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2.2 Properties of the English auction 

We next analyze the special properties implied by the above assumptions. 

Lemma 1. Any bidder-prudent protocol Alg^; increases the bid by at most 1 in 
each step. 

Proof. Suppose, for the sake of contradiction, that there exists a bidder-prndent 
protocol Alg^; which allows v to raise some bid by more than 1. Then, at some 
point t in some execution the last bid offered by some participant w 

was Bf — B(w, t) and at time t + 1, u offers B(v, t A 1) > B(w, t) + 1. 

Let Z be the set of participants other than v with valuation at least B(tc,t), 

Z = {P ^ V \ (3(P) > B(w,t)}- 

Consider now a different scenario, over an auction system = (/3^n), where 
/3^{u) = B{wA) for every u G Z and = /3{u) for every u ^ Z. Let us also 
specify all protocols Alg^ for all u G Z to act the same as in until 

round t — 1. At time t all new protocols are forced to resign. The executions 
remain the same until round 1, when v raises the bid to B(u, 1). After this 
point, no participant raises the bid again, and v ends up winning the auction 
with B(vAAl) instead of the current + 1 = B(w, t) + 1. This contradicts the 
prudentiality of Alg^;, since v could have won the auction by offering exactly 
B{wj t) + 1. I 

Corollary 1 . In a limited amd prudent auction protocol Alg on an auction 
system A^ the auctioneer must receive a bid B{vjt) — x by some u, at some time 
during the execution^ for every 1 < x < M 2 + 1. I 

Lemma 2. In a limited and prudent auction protocol Alg on an auction system 
A^ there cannot be a round t in which the auctioneer receives simultaneously two 
different new bids higher than Bt^i . 

Proof. Suppose, for the sake of contradiction, that there exists an execution 
in which on round t in the auction the highest published offer is Bt^.i 
and the auctioneer received B{wA) and B{vA) where Bt^i < B{wA)i B{vA) 
and also B{wA) # B{vA)‘ Without loss of generality, assume that B{wA) > 
B(vA)‘ But this implies that w^s protocol is not prudent by Lemma 1; contra- 
diction. I 

Corollary 2. For any prudent axid limited protocol Alg for a given auction 
system A, Tp^^g^(A) = J?(M 2 ) and C"alg(A) = + n). 

Proof Since there must be a separate bid, at a separate time, for each possible 
value between 1 and M 2 (Lemma 1, Corollary 1 and Lemma 2), Tjs^j^q(A) — 
J7(M2) and C^lg(A) ^ i2(M2). The bidding may stop only after all but one 
of the participants have resigned, hence C^lq(- 4) > J?(n). Combined together, 

CA^^(A) = fHM2An). I 
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Remark; The lower bound on commnnication complexity does not hold if the 
underlying communication network is synchronous, as employing the well-known 
idea of using silence to convey information, it is possible to devise an algorithm 
based on interpreting clock-ticks as active bids, with time complexity 0 {M 2 ) 
and communication complexity 0(n) (see [2]). 

Next, it is proven that under a prudent auction-protocol, the same participant 
cannot bid twice in a row unless someone else bid in between. 

Lemma 3. Under a limited and prudent auction protocol Alg on a given system 
Af if the auctioneer receives two different consecutive bids B{wjt) = B amd 
B{wff + 1) = B + 1 from the same participant then it must have received a 
bid of B ffvm another participant on round t. 

Proof Suppose that there is an execution in which participant w offers the bids 
B(w, t) = B followed by B(wffP 1) = B + 1, with no one else bidding in round 
t. Let Z be the set of participants whose valuation exceeds or equals B{wff)j 

i.e., 

Z = {a # tc I f3{v) > B{wff)}. 

Consider a different auction system = {(3^pn), where (3^(u) = B(wff) — 1 for 
every u G Z and = /3{u) for every u ^ Z. Note that in A^ the new valuations 
satisfy Mi > B{wff) + 1 and M 2 = B{wff) — 1. However, the execution 
and the bidding stay the same as in ri^^^^{A) up until time t, when w raised 
the bid to B{Wjt) followed by B{Wjt) + 1. Following this point, no participant 
will raise the bid, and w will end up winning this auction with B(w^ t) + 1. That 
contradicts the bidder-prudentiality of the auction protocol, since w could have 
won it with the first offer of B(wA) made at time t. | 



Remark; The algorithmic implementations of English auction discussed in the 
next section enforce the requirement implied by the last lamma by ensuring that 
the designated winner of round t, Hq, is not approached again by the auctioneer 
before getting a higher bid from someone else. 

Finally, we point out that the global lower bounds of Cor. 2 are easily achieved 
by an offline algorithm Opt. 

Lemma 4. For any auction system A — an offline algorithm Opt can 

perform an auction optimally in both communication and time^ be., TQpr^(A) = 
0 (M 2 ) and CQprp(Al) = 0(n + M 2 ). 

Proof Assume that v± and V 2 are the participants with valuations /3(ui) = Mi 
and /?(u 2 ) = M 2 . Then the optimal algorithm Opt runs an auction between these 
two participants until V 2 resigns when the bid reaches B = M 2 + 1. This takes 
TQprp{A) = CQprp(Al) = 0(M2). Now Opt addresses all other participants in 
a round of broadcast and convergecast, receiving simultaneously the remaining 
n — 2 resignation messages. This takes two additional time units and 2{n — 2) 
messages. Overall, Tqp^(AI) = 0(M2) and Cqp^(AI) = 0{n + M 2 ). | 
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3 Auctioning algorithms 

3.1 Set algorithms 

All our algorithms belong to a class termed set algorithms^ which can be cast 
in a uniform framework as follows. On each round t of the execution, the query 
set selected by the auctioneer is an arbitrary subset of active participants, at Q 
APt \ {Wt}j of size pt. The only difference between the various set algorithms is 
in the size pt fixed for the query set at on each round t. 

We develop our optimal algorithm through a sequence of improvements. Our 
first two simple algorithms represent two extremes with opposing properties; the 
first is communication optimal, the other - time optimal. 



The Singleton algorithm (SiNGb) Algorithm singleton (Singl) is simply 
the sequential set algorithm with pt = 1. Specifically, at every time t > the 
adversary chooses a query set at = {n} for some v € APt\{Wt} • The participant 
V chosen for being queried on each round can be selected in round-robin fashion, 
although this is immaterial, and an arbitrary choice will do just as well. 

For estimating the communication and time complexities of Algorithm SiNGL, 
note that the auctioneer needs to receive resignations from all participants but 
one, and also raise the bid up to M 2 + 1, getting C"giNQL(v4) = = 

0(M2 + n). 



Algorithm Full At the other extreme. Algorithm Full requires the auctioneer 
to address, on each round t, all active participants (except for the node Wt 
designated as the winner of the previous bidding round). I.e., it selects pt = 
\APt\ — 1 and hence at = APt \ {W}. 

Lemma 5. For any auction system A = = 0{M2) and 

= 0{nM2), 

Proof. For analyzing the time complexity of the algorithm, assume that Vi and 
V 2 hold valuations /3{vi) = Mi and /3{v2) = M 2 respectively. Since Algorithm 
Full addresses either vi or V 2 (and sometimes both) on each round, the auction 
will end after exactly M 2 + 1 rounds. Thus = 0(M2). 

As for the communication complexity, on each round t the auctioneer commu- 
nicates with all participants in APt. Hence over all M 2 + 1 rounds, the algorithm 
incurs Cpull(A) = \APt\ messages. As \APt\ < n — 1 for every t, we 

have Cpull(A) = 0 {nM 2 ). | 

We note that both bounds are tight for Algorithm Full, as can be seen by 
considering an auction system A — {(iju) where (i{v) — M\ for every node a, in 
which all participants are active on every round. 

The following subsections are devoted to the development of successively im- 
proved set algorithms based on some intermediate versions between Algorithms 
Singl and Full. 
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3.2 The fixed size scheme (FSS) 

Algoritlim fixed size scheme (FSS) is based on an intermediate point between 
Algorithms SiNGL and Full, represented by a fixed integer parameter p >. 
In each round tj the auctioneer addresses an arbitrary set of pt = p active 
participants. (Once the number of remaining active participants falls below p, 
it addresses all of them.) Namely, the query set on round t is some arbitrary 
at Q APt \ {Wt} of size pt = min{p, \APt\ — 1}. 

Lemma 6. For any fixed integer parameter p > 1 and auction system A = 
(/3,n), Tpgg(Al) = 0 {M 2 + n/p) and Opgg(Al) = 0 {M 2 ’ p+ n). 

Proof. The algorithm requires exactly M 2 + 1 bid-increase rounds to reach the 
final bid. In addition, there may be at most n/p rounds t in which the auctioneer 
receives resignations from all the participants of the query set hence gaining 
no bid increase. Overall, this yields a time complexity of Tpgg(v4) = 0(M2 + 
n/p). 

The communication complexity is bounded by noting that in each time step, 
the algorithm incurs (at most) p messages, hence Opgg(v4) < Tpgg(v4) ’ p = 

0{M2^pAn). I 

Again, the analysis is tight, as evidenced by an auction system A — {fipn) 
where (3(v) = Mi for every node v. 



3.3 The increasing size scheme (ISS) 

Examining the communication and time performance of Algorithm FSS reveals 
that using a large p value is better when M 2 < n, and on the other hand, if 
M 2 > n then a small p value is preferable. The break point between the two 
approaches is when M 2 = n. 

We next devise a set algorithm called increasing size scheme (ISS), which 
exploits this behavior by using a decreasing value of the parameter p, inversely 
proportional to M 2 . Since the value of M 2 is unknown, it is estimated by the only 
estimate known to us, namely, the current bid B. For simplicity, let us ignore 
rounding issues by assuming w.l.o.g. that n is a power of 2. The algorithm begins 
with p = n, and divides p by 2 whenever the current bid Bt doubles. I.e., p is 
set to once Bt reaches 2b Once B — Algorithm ISS continues as the 

sequential Algorithm Singl until the auction is over. 

Lemma 7. For any auction system A — Tjgg(Al) = 0{M2). 

Proof. Each phase i of Algorithm ISS starts with a bid value of B^^l — 2^™^ 
and ends when either all participants have resigned or Algorithm ISS reaches 
a bid of = 2% whichever comes first. Phase i is therefore similar to a run of 
Algorithm FSS with sets of size p* = n/2% an initial bid of B^^i = 2*™^ and a 
maximal bidding value of = min{M 2 , 2 ^}, or equivalently, an initial bid of 1 
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and a maximal bidding value of — x\ — 2^™^ < 2^™^. Hence each such phase 
i takes time = 2 (M 2 + n/p^) < 2(2^™^ + 2^) < 2^+^. 

Let ns first consider the case of an auction system A with M 2 < n. Then 
Algorithm ISS reaches at most phase If — r(logAf 2 + 1)], where it will reach 
the final bid of = M 2 + 1. The total time for all phases is therefore 

If nog(^'^2+i)l 

Tiss(A) < ^T^ < < 4 • = 0(M2), 

^=0 ^=0 

Now assume that M 2 > n. Then the execution of Algorithm ISS has logn phases. 
The first logn — 1 phases, as well as the steps of the last phase up to the point 
when Bf — n, take 0(2^"^^^) = 0(n) just as in the previous case. The remaining 
steps are performed in the sequential fashion of Algorithm SiNGL, starting at 
B = n and ending at Af 2 , thus including (at most) M 2 —n bidding rounds and n 
resignation rounds. The total time complexity is again Tjgg(v4) = 0 (^/ 2 ). | 

Lemma 8. For any auction system A = C"];gg(Al) = 0(M2 + nlogfi) 

where p = min{M 2 ,n}. 

Proof, As in the proof of Lemma 7, every phase i < logn in the execution 
of Algorithm ISS can be regarded as a complete execution of Algorithm FSS 
starting from some appropriate initial state with maximal valuation M^, Hence, 
as shown in the proof of Lemma 6, the bound on the bidding communication in 
a single phase i of Algorithm ISS is — ©(p^M^) < 0(2^™^ |j) = 0(n). The 
resignations throughout the auction require 0(n) additional messages. 

Again the analysis is divided into two cases. When M 2 < n. Algorithm ISS 
stops at phase If — |~log(M 2 + 1)]. Thus over all If phases, the number of 
messages incured by Algorithm ISS is 



If riog(M2+i)l 

GssG) = E^*= E 0(n) = 0(nlogM2). 

t=l t=l 

On the other hand, when M 2 > n. Algorithm ISS performs logn phases at a cost 
of 0(n) messages each, as shown above, summing up to 0(n logn) messages. The 
remaining steps are as in Algorithm SiNGL, starting at an initial bid of Bt = n 
and ending wt B A t — M 2 + 1, at a cost of 0(M2 — n) messages. In total, 
Ciss(-4) = 0 (M 2 + nlogn). 

Thus the communication complexity of Algorithm ISS in the general case is 
Cigg(Al) = 0(M2 + nlogp). I 



3.4 The varying size scheme (VSS) 

The ensuing discussion reveals the following property for set algorithms. When- 
ever many participants are willing to raise the bid, it is preferable to address 
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a small set <7^. On the other hand, if there are many resignations then it is 
advantageous if the approached set at is large. 

The VARYING SIZE SCHEME (VSS) algorithm attempts to exploit this prop- 
erty by varying the size of the addressed set dynamically. As in Algorithm ISS, 
phase k of the algorithm tries to double the bid B from 2^™^ to 2^. However, 
the exact size of the set at fluctuates dynamically during the phase. The initial 
set size used for the flrst round of phase k is set^ to After each round 

t, the set size used, denoted pt — \at\, is either doubled or halved^ according to 
the following rules. 

1. If the bid was raised in round t, and pt > 1, then pt is decreased by half, i.e., 
Pt^i ^ max{pt/2, 1}. 

2. If all the participants of at have resigned, and pt < \APt^i\^ then pt is 
doubled, setting pt^i ^ min{2pt, \ APt^i\}. 

When the bid B reaches n. Algorithm VSS continues as in the sequential Algo- 
rithm Since. 

We now proceed with an analysis of the time and communication complexities 
of Algorithm VSS. The set of rounds in phase fc, denoted can be divided 
into % — UkDBk^ The set Du consists of the down steps j which are rounds t in 
which the bid was increased, resulting in halving p for the next round (or leaving 
it at the minimal size of 1). The set Uk contains the up steps, which are rounds 
t in which all members of at resigned, resulting in doubling p for round t + 1 (or 
leaving it at the maximal available size at that round, which is \APt^i\). 

The set Uk may be split further into two kinds of up steps. 

1. f/| is the set of steps t where pt < P^ = ft- 

2. Ul is the set of steps t where pt^P^- 

Likewise, the set Dk can be divided further into two subsets of down steps. The 
flrst set, denoted by is the set of all down steps t which address a set of size 
Pt = ft where j > k for the first time during the phase (i.e., there was no prior 
up step from the same set size). Formally, 

f n 

Dl = {t e Dk I pt^ ^ for some j > k, 

and ptf # pt for every £ Tk s.t. t^ <t}. 

The remaining down steps are denoted = Dk \ D^. 

Lemma 9. Each phase k takes Tu = 0(2^) steps. 

Proof. As \Dk\ equals the number of bid raises in phase k, which is at most 2^, 
we have 

\Dk\ < r. ( 1 ) 

^ again assuming n to be a power of 2 
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Also, each step t E increases the set size from pt to 2pt < . Thus, there 

must have been a corresponding prior down step f that decreased a set of size 
2pf = 2pt to the size of pc = Pt- Hence 

Wl\ < \Dk\, 

SO also 

\Ul\<2\ (2) 

Finally, f/| consists of resignation steps that occurred on sets at of size pt > 
hence there can be at most 2^ steps of that sort before removing all possible 
n participants, implying that 

Pil < 2^ (3) 

Combining Inequalities (1), (3) and (2), we get that 

n = 2(pi\+\Ui\ + \Dk\) <6-2\ I 

Let ns now estimate the time complexity of Algorithm VSS. 

Lemma 10. For any auction system A — rygg(v4) = 0 (^/ 2 ). 

Proof, First suppose M 2 < n. Then Algorithm VSS is run for |~log(M 2 + 1)] 
phases, and by Lemma 9 the total time is 

nog(M2+i)i n«g(^L+i)i 

Wss(^) = Y. = E ^(2") = 

k=0 k=0 

In case M 2 > n, the logn phases take 0{n) time by the same calculation. 
Afterwards, Algorithm VSS operates as Algorithm Singl for the remaining 
M 2 — n bids. The time required for this stage is M 2 — n steps, plus (at most) n 
additional steps for resignations. Thus, overall, rygg(v4) = 0 (Af 2 )- I 

Next we deal with the communication complexity of Algorithm VSS. Note 
that the communication cost of round t is 2pt, and therefore a set of rounds X 
costs 2J^.^y Pt- Let Ck denote the total communication complexity of phase fc, 
and let n§ denote the number of participants which resigned during phase k. 
Let Let denote the communication 

cost of the final sequential stage of the algorithm. 

Lemma 11. (7^^^ < 4n + 6n^, 

Proof Let (7^, C[ and (7f denote the total amount of communication due to 
rounds t in Ukj D[ and Ff , respectively, hence Ck — + (7^ + (7f . We analyze 

each of the three terms separately. 

Clearly, Cf < 2n§, Turning to the communication cost of rounds in 
note that any round t E in phase k involves a set at of size Pt = §■ where 
j > k. Moreover, dI contains at most one round t such that Pt — % for every 
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logn > j > k. Thus, the number of rounds t in which pf — ^ throughout the 
entire execution of Algorithm VSS is bounded by j (as after reaching phase 
jj p^ < Ij, hence sets of size ^ will not be included in D[). Thus the total 
communication cost due to dI steps is 

log n oo . 

Ck = pt ^ Yy < ■ 

t^Dl 

Finally consider (7f . We argue that each round t G can be matched with 
a distinct prior resignation step in the same phase, r(t) G Uk- Specifically, r(t) 
is the largest round f < t satisfying f G Uk and pt^ — -y- Note that Uk must 
contain such a round f , since by definition of I?|, there exists some round 

G with pt^f = pt. On round + 1, Hence, the fact that 

by time t the size of at went back to pt^f implies that there must have been 
an intermediate step f < t in which the algorithm performs an up step 
bringing at back to size py/. Note also that by the definition, r(t) is unique for 
every t. It follows that 

Cl = 2 J2pt = ^Pr(t) < = 4nf . 

t^Di t^Di neUk 

It follows from the preceding three bounds that Ck < . Subse- 

quently, 

log n log n oo 

(7“» = ^(7fc < + + = 4n + 6n«. | 

k=0 k=0 

Lemma 12. When hP > n, < 2{M2 - rP), 

Proof. The communication incurred by the final SiNGL stage for increasing the 
bid from n to + 1 is 0{M2 — n). As each participant resigns at most once. 
Algorithm VSS also incurs (at most) n — n^ — I resignations during that stage, 
one for each of the n — rP remaining active participants except for the winner. 

I 

Corollary 3. For any auction system A — CYgg(Al) < 2 M 2 + Bn. 

Proof For M 2 < n, CYgg(Al) = = 4n + by Lemma 11. For the case 

of M 2 > n, we have in addition to that also a cost of = 2 M 2 — 2n^. In total, 
CySS (^) ^ 2 M 2 + 4n + < 2 M 2 + 8n. | 

Theorem 1. For any auction system A = Algorithm VSS achieves 

asymptotically optimal complexities Tyqq{A) — 0 {M 2 ) amd CYgg(-4) = 0{nA 

M 2 ). I 




Distributed Algorithms for English Auctions 



87 



3.5 Allowing dynamic participants 

One of the limitations of Algorithm VSS is that it requires all the participants to 
register in advance, since the beginning of the auction (as the algorithm takes the 
number of participants into account when deciding its querying policy). In actual 
computerized auctions, it is desirable to allow new participants to join in, so long 
as the auction has not terminated. In this section we extend our framework by 
allowing newcomers to join the auction. Assume there is some sort of bulletin 
board on which the process announces the auction and the current bid, and also 
that it has a mailbox in which it may receive requests to join the auction. 

For simplicity, the process will grant these requests only at the end of an 
auction (where only one participant left). Thus the entire auction process is 
composed of a sequence of auctions, viewed as sub- auctions of the entire auction, 
each starting upon termination of the previous one, until no further requests to 
join arrive. 

More specifically, the auctioneer acts as follows. It starts by running a first 
sub-auction on the initial set of participants, This sub-auction is run until 
all participants but one have resigned, and the current bid is Bi — + 1 , 

made by participant . The auctioneer now opens the mailbox and reviews the 
requests to join. Let denote the set of newcomers asking to join the auction. 
The auctioneer now starts a sub-auction on V‘^ U starting from the initial 

bid El + 1, rather than 1 . (This may well be transparent to v^^ itself, i.e., there 
is no need for it to know that the first sub-auction has finished and a new sub- 
auction has begun.) This second sub- auction terminates with some participant 

making a bid of B 2 = Af| + 1 , where Af| is defined as the second highest 
valuation on U Once this sub-auction has terminated, the auctioneer 

again inspects its mailbox, and if there are additional newcomers then it repeats 
the process. Hence the entire auction process finalizes only once a sub-auction 
ends and no additional requests to join have arrived, i.e., the mailbox is empty. 
Evidently, the execution may be quite long (or even infinite, assuming the value 
of the sold item keeps raising!) In practice, however, the number of rounds in 
the auction will be bounded by the value of the sold item, which is presumably 
stable over short periods of time. 

Formally, the extended auction system can be represented by an initial auc- 
tion system Ai = and a sequence of extensions^ M 2 , M 3 , . . . , Mp. An 

extension represents an entry point of new participants, described by a pair 
M^ = where n^ is the number of newcomers at the entry point, denoted 

— {ui,...,u^J, and /3^ is their valuation function. We also denote by M| 
the second highest valuation among the newcomers in the ith extension and the 
winner of the previous sub-auction. Note that the initial bid for the (i + l)st 
sub- auction is B^^l = + 1. Note also that is the second highest valuation 

over VK 

To handle this type of extended auction system we present Algorithm ex- 
tended VARYING SIZE SCHEME (EVSS), which acts as follows. The initial sub- 
auction is handled the same as in Algorithm VSS. Each of the following sub- 
auctions is again managed by a variant of Algorithm VSS. As explained above. 
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this variant starts the bidding process at the highest bid of the previous 
snb" auction. However, the algorithm must shift the bids by B for the purpose of 
computing the size pt of the query set in each round t. That is, phase k of the 
algorithm must now be defined as the phase during which increases the bid from 
B + 2^™^ to H + 2^ (rather than as the phase during which the bid is doubled 
from 2^™i to 2^). 

It is straightforward to verify that the analysis of Algorithm VSS in the 
previous section goes through, and guarantees the following. 

Lemma 13. For i > 1, the ith sub-auction requires ^gYgg(Al^) = 0(M| — 
(denoting M| = 0) and CevSS(^0 = 0{M^ - + m). 

For an extended auction system A — (Ali, . . . , Alp), let n — 
define as above. Recall that Aff is the second highest valuation over the set 
of all participants throughout the entire auction process, IJf=i * We have the 
following. 

Theorem 2. For any extended auction system A — (Ali, . . . , Alp), r^vss(^) ~ 
0(Mf) and CevSS(^) = 0{n + Mf). 

Let us remark that a more fiexible variant of Algorithm EVSS, allowing 
various entry points in the middle of a sub-auction, rather than only at the 
end of each sub- auction, is described in [2], and is shown to enjoy the same 
asymptotic complexities. 
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Abstract. This paper presents a scalable leader election protocol for 
large process groups with a weak membership requirement. The under- 
lying network is assumed to be unreliable but characterized by proba- 
bilistic failure rates of processes and message deliveries. The protocol 
trades correctness for scale, that is, it provides very good probabilistic 
guarantees on correct termination in the sense of the classical specifi- 
cation of the election problem, and of generating a constant number of 
messages, both independent of group size. After formally specifying the 
probabilistic properties, we describe the protocol in detail. Our subse- 
quent mathematical analysis provides probabilistic bounds on the com- 
plexity of the protocol. Finally, the results of simulation show that the 
performance of the protocol is satisfactory. 

Keywords: Asynchronous networks, process groups, leader election, 

fault-tolerance, scalable protocols, randomized protocols. 



1 Introduction 

Computer networks are plagued by crashing machines, message loss, network 
partitioning, etc., and these problems are aggravated with increasing size of the 
network. As such, several protocol specifications are difficult, if not impossible, 
to solve over large-scale networks. The specifications of these protocols, which in- 
clude reliable multicast, leader election, mutual exclusion, and virtual synchrony, 
require giving strong deterministic correctness guarantees to applications. How- 
ever, in results stemming from the famous Impossibility of Consensus proof by 
Fischer-Lynch-Paterson [8], most of these problems have been proved to be un- 
sol vable in failure-prone asynchronous networks. Probabilistic and randomized 
methodologies are increasingly being used to counter this unreliability by reduc- 
ing strict correctness guarantees to probabilistic ones, and gaining scalability in 
return. A good example of such a protocol is the Bimodal Multicast protocol 
[1], an epidemic protocol that provides only a high probability of multicast de- 
livery to group members. In exchange, the protocol gains scalability, delivering 
messages at a steady rate even for large group sizes. 

^ This work was funded by DARPA/RADC grant F3060 2-99- 1-6532 and in part by 
the NSF grant No. EIA 97-03470.. 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 89-103, 2000. 
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Fig. 1. Target Setting. 



Our current work is targeted toward realizing similar goals for the important 
class of protocols that classically have been formulated over reliable multicast 
message delivery. We envision a world where applications would run over a new 
class of probabilistic protocols (Figure 1) and receive probabilistically guaran- 
teed services from the layer below. By virtue of the proposed approach, these 
applications would scale arbitrarily, while guaranteeing correctness with a cer- 
tain minimal probability even in the face of an unreliable network. For example, 
these protocols could be used to build a replicated file system with probabilistic 
guarantees on consistency. 

As a step towards this goal, this paper presents a probabilistic leader election 
protocol. Leader election arises in settings ranging from locking and synchroniza- 
tion to load balancing [12] and maintaining membership in virtually synchronous 
executions [13]. The classical specification of the leader election problem for a 
process group states that at the termination of the protocol, exactly one non- 
faulty group member is elected as the leader, and every other non-faulty member 
in the group knows about this choice. In this paper, we show that, given prob- 
ability guarantees on point-to-point (unicast) and multicast message delivery, 
process failure rates, and multicast group view content, our protocol gives a 
very high probability of correct termination. In return, it gains on the scalabil- 
ity: with very high probability, the protocol involves only a constant number of 
messages regardless of group size. We also show how to augment our protocol 
to adapt to changing failure probabilities of the network (w.r.t. processes and 
messages). 

Sab el and Marzullo [20] proved that leader election over a failure-prone asyn- 
chronous network is impossible. This and a variety of other impossibility results 
all stem from the FLP result [8], which proves that there is no protocol by which 
an asynchronous system of processes can agree on a binary value, even with only 
one faulty process. 

To provide a taxonomy of the complexity of the class of consensus protocols, 
Chandra and Toueg [4] proposed extending the network with failure deteetors. 




A Probabilistic Correct Leader Election Protocol 



91 



However, the leader election problem can be solved if and only if a perfect 
failure detector is available - one that suspects no alive processes, and eventually 
suspects every faulty one [20]. [6] discusses several weakened system models 
and what types of consensus are possible in these models, while [7] presents a 
weakened asynchronous model which assumes that message deliveries are always 
time-bounded. Since “real” systems lack such guarantees, these results have been 
valuable mostly in a theoretical rather than a practical sense. 

Non-randomized leader election algorithms for a failure-prone asynchronous 
network model broadly fall into the following flavors. 1) Gallager-Humblet-Spira- 
type algorithms [9, 17] that work by constructing several spanning trees in the 
network, with a prospective leader at the root of each of these, and recursively 
reduce the number of these spanning trees to one. The correctness guarantees 
of these algorithms are violated in the face of pathological process and message 
failures. 2) Models that create logical partitions in the network when commu- 
nication becomes unreliable, each logical partition electing one leader [7]. This 
approach does not solve the scalability problem but circumvents it. 3) Models 
that involve strong assumptions such as, for example, that all (process) failures 
occur before the election protocol starts [22], or that all messages are delivered 
reliably [3]. 

Probabilistic solutions to leader election in general networks are usually clas- 
sifled as randomized solutions to the consensus problem [5], but these focus on 
improving either the correctness guarantee [19], or the bound on the number 
of tolerated failures [23]. The (expected or worst case) message complexities in 
these algorithms are typically at least linear in the group size, and fault toler- 
ance is usually guaranteed by tolerating process failures up to some fraction of 
the group size. Further, most of these protocols involve several rounds of 0{N) 
simultaneous multicasts to the group (where N is the group size), and this can 
cause deterioration of the delivery performance of the underlying network. 

Our take on the leader election problem is in a more practical setting than 
any of the above cited works. We are motivated by practical considerations of 
scaling in a real network where failures can be characterized by probabilities. The 
spirit of our approach is close to that of [1] and [24]. Our protocoFs probabilistic 
guarantees are similar to those of leader election algorithms for the perfect infor- 
mation model [14, 25], while our guarantee on the number of messages resembles 
that of [11], which presents an election protocol for anonymous rings. To the 
best of our knowledge, ours is the flrst protocol that trades correctness of the 
leader election problem for better scalability. 

The analysis and simulation of our protocol will assume a network model 
where process failures, and message delivery latencies and statistics have identi- 
cal, independent and uniform distributions. Before doing so, however, we suggest 
that the leadership election algorithm proposed here belongs to a class of gossip 
protocols, such as Bimodal Multicast [1] , where such a simplifled approach leads 
to results applicable in the real world. Although the model used in [1], like the 
one presented here, seems simplifled and unlikely to hold for more than some 
percentage of messages in the network, one flnds that in real-world scenarios. 
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even with tremendous rates of injected loss, delay, and long periods of correlated 
disruption, the protocol degrades gracefully in its probabilistic guarantees. 

The rest of the paper is organized as follows. Section 2 describes the assumed 
model and statement of the election problem we solve. Section 3 describes the 
protocol in detail. Section 4 analyses the protocol mathematically, while Section 5 
presents simulations results. In Section 6, we present our conclusions. 

2 The Model and Problem 

2.1 Model 

In our model, all processes have unique identifiers (c.^., consisting of their host 
address and local process identifier). All processes that might be involved in 
the election are part of a group ^ which can have an arbitrarily large number of 
members. Each process has a possibly incomplete list of other members in the 
group, called the process^ view. A process can communicate to another process 
in its view by ucast (unicast, point-to-point) messages, as well as to the entire 
group by mcast (multicast) messages. 

Processes and message deliveries are both unreliable. Processes can undergo 
only fail-stop failures, that is, a process halts and executes no further steps. 
Messages (either ucast or mcast) may not be delivered at some or all of the 
recipients. This is modeled by assuming that processes can crash with some 
probability during a protocol round and a ucast (mcast) message may not reach 
its recipient (s) with some probability. Probabilistically reliable multicast can be 
provided using an epidemic protocol such as Bimodal Multicast [1]. The Bimodal 
multicast protocol guarantees a high probability of multicast message delivery 
to all group members in spite of failures by having each member periodically 
gossip undelivered multicasts messages to a random subset of group members in 
its view. 

A few words on the weak group model are in order. As we define them, views 
do not need to be consistent across processes, hence a pessimistic yet scalable 
failure detection service such as the gossip heartbeat mechanism of [24] suffices. 
New processes can join the group by multicasting a message to it, and receiving 
a reply/state transfer from at least one member that included it in its view. 

Our analysis later in this paper assumes a uniform distribution for process 
failure probabilities {pfaii)i ucast /mcast message delivery failure probabilities 
{PucastilPmcastdi well as the probability that a random member has another 
random member in its view, which we call the view probability {view-prob). 



2.2 Problem Statement 

An election is initiated by an mcast message. This might originate from, say, a 
client who wants to access a database managed by the group, or one or more 
member (s) detecting a failure of a service or even the previous leader. In our 
discussion, we will assume only one initiating message, but the extension of our 
protocol to several initiating messages is not too difficult. 
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In classical leader election, after termination there is exactly one nomfanlty 
process that has been elected leader, and all nomfanlty processes know this 
choice. In probabilistic leader election, with known high probability^ 

— ( Uniqueness) there is exactly one nomfanlty process that considers itself the 
leader; 

— (Agreement) all nomfanlty group members know this leader; and 

— (Seale) a round of the protocol involves a total number of messages that can 
be bounded independent of the group size. 

3 Probabilistic Leader Election 

This section describes the proposed leader election protocol. The protocol con- 
sists of several rounds, while each round consist of three phases. In Section 3.1, 
we present the phases in a round. In Section 3.2, we describe the full protocol 
and present its pseudo-code. 

3.1 Phases in a Round of the Protocol 

Filter Phase We assume that the initiating mcast I is uniquely identified by a 
bit string A/. For example. A/ could be the (source address^ sequence number) 
pair of message /, or the (election #, round #) pair for this election round. Each 
group member Mi that receives this message computes a hash of the concatena- 
tion of A I and MiS address, using a hash function H that deterministically maps 
bit strings to the interval [0, 1]. Next, Mi calculates the filter value H (Af^ A/) x Ni 
for the initiating message, where Ni is the size of (number of members in) M^s 
current view. Mi participates in the next phase of this round, called the Relay 
Phase, if and only if this filter value is less than a constant K] otherwise it waits 
until the completion of the Relay phase. We require that H and K be the same 
for all members. We show in Section 4 that for a good (or fair) hash function 
H, large total number of group members N, the probability that the number of 
members throughout the Relay phase lies in an interval near K, quickly goes to 
unity at small values of K. This convergence is independent of N and is depen- 
dent only on the process failure, message delivery failure and view probabilities. 

If the NiN are the same for all members, each member Mi can calculate the 
set of members {Mj}i in its view that will satisfy the filter condition. It does so 
by checking if H(MjAi) x Ni < K for each member Mj in its view. In practice, 
the NiN may differ, but this will not cause the calculated set {Mj}i to differ 
much from the actual one. (A more practical approach is to use an approximation 
of the total number of group members for Ni. This can be achieved by gossiping 
this number throughout the group. Thus, N^s of different members will be close 
and the above filter value calculation will be approximately consistent.) 

Figure 2 shows an illustration of one protocol round. The initiating multicast 
I is multicast to the entire group, but some group members may not receive it 
since mcast delivery is unreliable. The ones who do receive it evaluate the filter 
condition in the next step. The members labeled with solid circles (2, 3, N) find 
this condition to be true and hence participate in the Relay phase. 




94 



I. Gupta, R. van Renesse, and K.P. Birman 




Fig. 2. One (Successful) Protocol Round. 



Relay Phase As explained earlier, a member Mi that has passed the filter and 
is participating in the Relay phase can calculate the subset of members {Mj}i 
in its view that would have passed the filter condition if they received L In 
the Relay phase, first sends ucast messages to all such members in the set 
{Mj}i specifying Mi^s preferred choice for a leader from among its view members. 
This choice is determined by the ordering generated by a choice function which 
evaluates the advantages of a particular member being elected leader. We require 
no restriction on the particular choice functions used, although all members need 
to use the same choice function in evaluating their preferred leaders, breaking 
ties by choosing the process with a lower identity. A good choice function would 
account for load, location, network topology, etc. [21]. 

Second, whenever is contacted by another member Mf. in a similar man- 
ner, it includes Mf. in its view (and adds it to and compares Af^’s choice 

with its own. If Af^’s choice is “better” than its own according to the choice func- 
tion, Mi relays this new choice to all the members in the set {Mj}i by ucast 
messages, and replaces its current best choice for leader. Otherwise, Mi replies 
back to Mf. specifying its current best leader choice. 

In the example of Figure 2, the 2^^, 3”^ and group members enter the 
Relay phase, but the 2^^ member subsequently fails. If either of the 3”^ and 
the members has the other in its view, they will be able to exchange relay 
messages regarding the best leader. 

Consider the undirected graph with nodes defined by the set of members 
participating in the Relay phase (relay members), and an edge between two 
members if and only if at least one of them has the other in its view throughout 
the phase. We call this the relay graph. Assuming timely message deliveries and 
no process failures, each connected component of this graph will elect exactly 
one leader, with a number of (ucast) messages dependent only on the size of 
the component. In Sections 4 and 5, we show that for a good hash function, the 
likelihood of the relay graph having exactly one component (and thus electing 
exactly one leader in the Relay Phase) , approaches unity quickly at low values of 
K. Further, this convergence is independent of N and is dependent only on the 
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process failure, message delivery failure and view probabilities. In Section 5, for 
an example choice function that is widely used in many distributed systems, we 
show that message delivery and process failures do not affect this convergence. 
Note that the number of ucast messages exchanged in a Relay phase with m 
members is O(m^), since each relay member’s best choice might be communi- 
cated to every other relay member. 

Finally, at the end of the Relay phase, when each component has decided 
on one leader, each member Mi participating in the Relay phase multicasts the 
identifier of the leader selected by Af^’s component (Af^’s current best choice) to 
the entire group — this is the set of final multicasts of this election round. The 
total number of multicast messages in the Relay phase is thus 0{m). Since it is 
likely that m lies in an interval near the protocol parameter K which is chosen 
regardless of N (analysis of Section 4), this implies only a constant number of 
ucast and mcast messages in the Relay Phase with high probability, regardless 
of the value of N. 

In the example of Figure 2, once the 3^^ and members have agreed on 
a leader, each of them multicasts this information to the group. Some of the 
group members may not receive both multicasts, but it is unlikely that every 
non-faulty member will receive neither. 



Failure Detection Phase Consider a situation in which there is more than one 
connected component in the relay graph. Each of these components may select 
and multicast different leaders in the Relay phase. Having each Relay phase 
member broadcast its component’s selected leader to the entire group using a 
probabilistic multicast mechanism (such as Bimodal Multicast [1]) would give 
us a high probability that this inconsistency is detected by some group member 
(which need not have participated in the Relay phase). If a member detects 
an inconsistency such as two leaders elected in the same round, it immediately 
sends out a multicast to the entire group re-initiating the next election round. 
If no member detects any such inconsistency, the election protocol round would 
satisfy the Uniqueness and Agreement conditions of Section 2.2 if and only if 
there was exactly one component in the Relay phase, this component selected 
exactly one leader, every other non-faulty group member received at least one 
of the multicast messages specifying this selected leader, and this elected leader 
did not fail during the election round. 

To reduce the probability of many group members sending out a (re-) initiating 
multicast message at the same time, we could have each member Af^ calculate the 
hash (using H) of its own id concatenated with the message identifier of one of 
the resulting messages, and send out a re-initiating multicast only if this is lower 
than KfNi. This would again give an expected constant number of re-initiating 
multicasts. Alternatively, we could use a randomized delay before sending the 
request: if a process receives a re-initiation request, it need not send one of its 



own. 




96 



I. Gupta, R. van Renesse, and K.P. Birman 



Member MiuElection (Sequence^ RoundNum)i 

1. On receiving “Init election” message I specifying (Sequence^ RoundNum)^ 

select K from RoundNum using strategy 
if H{MiAi) X Ni < K, go to step 2 

else wait for timeout period Time-Out -1 (time for step 2 to complete) and jump to step 3 

2. Find the set of members {Mj}i in my view such that H{MjAi) x Ni <, K 

find best preferred leader in my view and send this using ncast messages to members in 
do until Time-Out-2 

receive similar preferred leader messages for this (Sequence^ RoundNum) from 
other members 

include in {Mj}i and MFs view 

compare current best leader choice with preference (using choice function) 

if MOs preference better, 

update current best leader choice and send ncast messages to all members in {Mj}i 
specifying this 

else 

inform Mk using a ncast of Mi ’s current best choice 
wait Time-Out-d to receive everyone’s final leader choice. 

3. if received none or more than one leader as final choice, 

choose one of the final choice messages F 
if H(MiAF) X Ni < iF, 

multicast an initiating message O specifying (Sequence^ RoundNum + 1) 

-wait for Time-Out-3^ increment RoundNum and jump to step 1 
if no re-initiating mcast received within another Time-Out-Z^ 

declare received choice as elected leader and include it in Mi ’s view 
else increment RoundN um and jump to step 1. 



Fig, 3, The Complete Election Protocol. 



3.2 General Protocol Strategy 

Figure 3 contains the pseudo-code for the steps executed by a group member 
during a complete election protocol, each distinct election specified by a unique 
sequence number SequenceNurn. Our complete election protocol strategy is to 
use the election round described in the previous section as a building block 
in constructing a protocol with several rounds. A complete protocol strategy 
specifies 1) the value of K to be used (by each member) in the first round of 
each election, 2) the value of K to be used in round number ^ + 1 as a function of 
the value of K used in round and 3) a maximum number of rounds after which 
the protocol is aborted. Note that this strategy is deterministic and known to 
all members, and is not decided dynamically. In Figure 3, RoundNum refers to 
the current round number in this election. Mi :: Election{SequenceNum^ 1) is 
called by Mi on receipt of the initiating message for that election protocol. 

As we will see in Section 4, the initial value of K can be calculated from the 
required protocol round success probability, view probabilities, process and mes- 
sage delivery failure probabilities for the network in which the group members 
are based, and the total maximum number of group members. Unfortunately, in 
practice, failure probabilities may vary over time. Since a higher value of K leads 
to a higher probability of success in a round (Section 4), we conclude that round 
^ + 1 must use a higher value of K than round 1. For example, one could use twice 
the value of K in round I, for round ^+1. This class of strategies make our leader 
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election protocol adaptive to the unpredictability of the network. Note that a 
low maximnni number of protocol rounds implies fewer expected messages while 
a higher value results in a better probability of correct termination. 

The pseudo-code of Figure 3 has the members using time-outs (the TimejOut-^ 
values) to detect (or, rather, estimate) completion of a particular part of the pro- 
tocol round in an asynchronous network.^ Time-Out-2 is the expected time for 
termination of the Relay phase (before the final multicasts at the end). This is 
just the worst case propagation and processing delay needed for a message con- 
taining a relay member’s initial preferred leader to reach all other relay members 
(if it is not lost by a process or link failure). Although the number of relay mem- 
bers is not known a priori, we show in Section 4 that with known high probability, 
the number of relay members who do not fail until the end of the Relay phase 
is at most (3K/2). Thus Time-Out-2 can be taken to be the product of {SKI 2) 
(the maximum length of any path in a relay graph with SK/2 members) and 
the maximum propagation and processing delay for a ucast packet in the un- 
derlying network. TimejOutS is just the worst case time for delivery of a mcast 
message. In the Bimodal multicast protocol [1], this would be the maximum 
time a message is buffered anywhere in the group. TimejOut-l is the sum of the 
maximum time needed at member Mi to calculate the set and the values 

of Time -Out -2 and Ti me -Out S. Also, a member ignores any messages from 
previous protocol rounds or phases, and “jumps ahead” on receiving a message 
from a future protocol round or phase. 



4 Analysis - Properties of the Protocol 

In this section, we summarize the analysis of the probability of success, detection 
on incorrect termination and message and time complexity of a round of our 
protocol. Detailed discussions and proofs of the results are available in [10]. 

Let N be the number of group members at the start of the election round - we 
will assume that this value is approximately known to all group members so that 
the filter value calculation is consistent across members. Let view-proh be the 
probability of any member Mi having any other member in its view through- 
out the election round. Let Pfaih Pucasth Pmcasti b® the failure probabilities of a 
process during an election round, a ucast message delivery, and a mcast message 
delivery, respectively. The protocol round analyzed uses the parameter K as in 
Figure 2. We denote the terms {{l—pmcasti)'^) P fail)' Pmcasti) 

as Ki and K 2 respectively. 

For simplicity, we assume that the probabilities of deliveries of a mcast mes- 
sage at different receivers are independent, as well as that the Time-Out-^ values 
in the protocol of Figure 2 are large enough. Our analysis can be modified to 
omit the latter assumption by estimating the Time-Out-^ values from K and the 
worst-case propagation and processing delays for ucast and mcast messages (as 

^ Although an asynchronous network model does not admit real time, in practice 
timers are readily available, and we do not assume any synchronization nor much in 
the way of accuracy in measuring intervals. 
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described in Section 3.2), and redefining PucastiiPmcasti) to be the failure proba- 
bilities of a ncast (mcast) message delivery within the corresponding worst-case 
delays, as well as calculating view-prohj pfau for a round duration. We also 
assume that the hash function H used is fair, that is, it distributes its outputs 
uniformly in the interval [0, 1]. For a particular hash function (c.^., the one de- 
scribed in [15]), we would need to know its distribution function and plug it into 
a similar analysis. 

Consider the following events in an election round with parameter K: 

El: between Ki/2 and 3i^i/2 members are chosen to participate in the Relay 
phase, and between and relay members do not fail before sending 

out the final multicast; 

E2: the set of relay members who do not fail before their final multicasts form a 
connected component in the relay graph throughout the Relay phase; 

E3: at the end of the Relay phase, each non-faulty relay member has selected 
the same leader; 

E4: by the end of the election round, each group member either fails or receives 
at least one of the final multicast messages (specifying the selected leader) from 
each component in the relay graph at the end of the Relay phase; 

E5: the elected leader does not fail. 

Theorem 1 (Round success probabilities); 

(a) The event [E3, E4, E5] in an election round in the protocol of Figure 2 implies 
that it is successful, that is, the election satisfies the Uniqueness and Agreement 
properties of Section 2.2. 

(b) From [2, 16], the probability of success in an election round with parameter 
K can be lower bounded by 



Pr[El, E2, E3, E4, E5] 

= Pr[El] ’ Pr[E2|El] ' Pr[E3|El, E2] ' Pr[E4|El, E2, E3] ’ Pr[E5|El, E2, E3, E4]. 

> (1 ^ ^ 

'{{Pfail + (1 ~ Pfail)i^ ~ Prncastl^P^ ' C ~ Pfail) 



□ 

Figure 4 shows the typical variation of the lower bounds (subscript lb stands 
for ‘lower bound”) of the first four product terms and Prib[El, E2, E3, E4, E5], 
for values of K up to 65, with {view-prob^PracaBthPucaBthPfaiii^) 

= (0.4,0.01,0.01,0.01, 10000). The quick convergence of Prib[El] and Prib[E2|El] 
to unity at small K (here ^ 40) is independent of the value of iV. In fact, 
Prib[E4| El, E2, E3] is the only one among the five factors of Prib[El, E2, E3, E4, E5] 
that seems to depend on iV. However, its value remains close to unity for 
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N ' P^Jasti or iV *C Pra^asti^ which, for K — 40, turns out as 10^^, a 

number beyond the size of most practical process groups. 




Fig. 4. Pessimistic Analysis of Success Probability of one round of our Leader Election 
Protocol. 



Thus, for all practical values of initial group size iV, the minimum probability 
that an election round of the protocol of Figure 2 satisfies the Uniqueness and 
Agreement conditions is dependent only on the failure and view probabilities in 
the group, but is independent of N. 

From Figure 4, this minimal protocol round success probability appears to 
peak at 0.6 for the above parameters. This is because our estimate for 
Prib[E3|El, E2] is very pessimistic in assuming a weak global view knowledge, 
and thus including the possibility that all the initially preferred leaders in the 
Relay phase might be distinct. In a practical setting however, a fair number of 
the m (nomfaulty) relay members would have the same initial leader choices (eg., 
if the choice function preferred a candidate leader with lower identity), so the 
probability Prib[E3|El, E2] (and hence Prib[El, E2, E3, E4, E5]) would be much 
higher than the curve shows. The simulation results in Section 5 confirm this for 
the choice function mentioned above. 

Theorem 2 (Detection of incorrect termination in a round); Pr[ a re- 
initiating mcast is sent out to the group or all group members fail by the end 
of the round | election round with parameter K does not succeed ] is bounded 
below by Prib[El] • (1 ^ (1 ^ (1 -p/a«)(l - 

Note that, with K fixed so that the term Prib[El] is arbitrarily close to unity, 
the probability of detection of incorrect election in a round of the presented pro- 
tocol goes to unity as N tends to infinity. □ 
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Theorem 3 (Round message complexity); With (high) probability Prib [El] 
((Prib[El])^), the number of ucast (mcast) messages in an election round is 
(0(i^)) (since Ki^ K 2 are both 0{K)). Also, with (high) probability 
(Prib[El])^, the number of simultaneous multicasts in the network anytime dur- 
ing the round is 0{K). □ 

Theorem 4 (Round message complexity); Further, the expected number of 
ucast (mcast) messages in a round of the protocol is 0{K^) {0{K)). This is 
0(1) when K is fixed independent of N. The suggested election protocol round 
thus achieves the optimal expected message complexity for any global agreement 
protocol on a group of size N. □ 

Theorem 5 (Round time complexity); With (high) probability (Prib[El])^, 
the time complexity of an election round is 0{MK + iV) for a group of size N 
over a network with M nodes. This is 0{N + iV) for fixed independent of iV, 
which is the optimal time complexity for any global agreement protocol. □ 

5 Simulation Results 

In this section, we analyze, through simulation, the performance of an election 
protocol strategy from the class described in Section 3.2. The correctness, scal- 
ability and fault tolerance of the proposed protocol are more evident here than 
from the pessimistic analysis of Section 4. The strategy we analyze is specified 
by 1) an initial (first round) parameter = 7; 2) for ^ < 4, the value of K in 
round I is twice the value used in round ^ — 1; and at ^ = 5, = iV; and finally 

3) the election protocol aborts after 5 rounds. The protocol is initiated by one 
mcast to the group, which initially has N members. 

The unreliability of the underlying network and process group mechanism 
is characterized by the parameters p^ca, 3 tiiPmcaBtiiPfaiii'^iGW-proh as defined in 
Section 4. The hash function is assumed to be a fair one. The choice function used 
in the simulation is the simple one that prefers candidates with lower identities. 

The metrics used to measure the performance of the protocol are the fol- 
lowing. P (Success) evaluates the final success probability of the protocol, and 
appears in two forms. “Strong” success probability refers to the (average) prob- 
ability that a protocol run satisfies the Uniqueness and Agreement conditions. 
“Weak” success probability is in fact the (average) majority fraction of the non- 
faulty group members that agree on one leader at the end of the protocol. This 
is a useful metric for situations where electing more than one leader may be al- 
lowed, such as [18]. # Rounds refers to the average number of rounds after which 
the protocol terminates, either successfully, or without detecting an inconsistent 
election, or because the maximum number of rounds specified by the strategy 
has been reached. # Messages refers to the average number of ucast and mcast 
messages generated in the network during the protocol. 

Figure 5 shows the results from the simulations. This figure is organized with 
each column of graphs indicating the variation of a particular performance metric 
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as a function of each of the system parameters, and each row of graphs showing 
the effect of varying a system parameter on each of the performance metrics. 
Each point on these plots is the average of results obtained from 1000 runs of 
the protocol with the specified parameters. In Figures 5(a”C), Pucasti — Pmcasti 
is varied in the range [0,0.5] for fixed N = 2000, p/ai/ = 0.001 pview-prob = 0.5. 
The graphs for varying PfaU are very similar and not included here. In Figures 
5(d”f), N is varied in the range [1000,5000] for fixed Pfau = 0.001 /view -prob — 
0-^, Pucasti — Pmcasti = 0.001. In Figures 5(g”i), view-prob is varied in the range 

[0.2, as] fo, .Y = 5000,„,„ = o,oo1p.„,;, = 0,001. 

Figures 5(a,d,g) show the very high success probability (strong) guaranteed 
by the above strategy even in the face of high message loss rates (up to Pucasti — 
Pmcasti = 0.4, up to and beyond N = 6000 and view-prob = 0.2). Notice that 
even the “weaF^ success ratio is close to 1 for these ranges, and as expected, 
is higher than the strong success probability. Figures 5(b,e,h) show the time 
scalability of the protocol for the same ranges of parameters that produced 
high success probabilities. Note Figure 5(e), which shows termination within 1 
expected round for values of N up to 6000 (!) group members. Figures 5(c,f,i) 
show the message scalability for the same variation of parameters. Note again 
the lack of variation in the expected number of messages exchanged (Figure 5(f)) 
as N is varied up to 6000 members. 

Figures 5(a-c) display the level of fault tolerance the protocol possesses with 
respect to message failures. Figures 5(dT) show how much our protocol scales 
even as the number of group members is increased into the thousands. Finally, 
Figures 5(g-i) show that our protocol performs well even in the presence of only 
partial membership information at each member. 

6 Conclusions 

This paper described a novel leader election protocol that is scalable, but pro- 
vides only a probabilistic guarantee on correct termination. Mathematical analy- 
sis and simulation results show that the protocol gives very good probabilities of 
correct termination, in the classical sense of the specification of leader election, 
even as the group size is increased into the tens of thousands. The protocol also 
(probabilistically) guarantees a low and almost constant message complexity in- 
dependent of this group size. Finally, all these guarantees are offered in the face 
of process and link failure probabilities in the underlying network, and with only 
a weak membership view requirement. 

The trade-off among the above guarantees is determined by one crucial proto- 
col parameter-the value of K in an election round. From the simulation results, 
it is clear that choosing iL to be a small number (although not very small) suf- 
fices to provide acceptable guarantees for the specified parameters. Increasing 
the value of K would enable the protocol to tolerate higher failure probabilities, 
but would increase its message complexity. Varying K thus yields a trade-off 
between increasing the fault tolerance and correctness probability guarantee on 
one hand and lowering the message complexity on the other. 
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Fig (a), P(Succe3s) vs message loss 
probability. 



Fig (b). Avg. #Rounds vs message 
loss probability. 



Fig (c), Avg, #Me33ages vs mes- 
sage loss probability. 







Fig (g) , P( Success) vs view prob- Fig (h) , Avg, #Rounds vs view Fig (1) , Avg, #Mes sages vs view 
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Fig. 5. Performance characteristics of onr Leader Election Protocol 



References 



1. K.P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, Y. Minsky, ^'Bimodal 
multicast”, ACM Trans. Computer Systems, vol. 17, no. 2, May 1999, pp. 41-88. 

2. B. Bollobas, A. Thomason, ‘‘Random graphs of small order”. Annals of Diserete 
Mathematies, Random Graphs ’83, vol. 8, 1983, pp. 47-97. 

3. J. Brunekreef, J.-P. Katoen, R. Koymans, S. Mauw, “Design and analysis of dy- 
namic leader election protocols in broadcast networks”. Distributed Computing, 
vol. 9, no. 4, Mar 1997, pp. 157-171 









A Probabilistic Correct Leader Election Protocol 



103 



4. T.D. Chandraj S. Toueg, ‘‘Unreliable failure detectors for asynchronous systems” ^ 
Proc, 10th Annual ACM Symp, Principles of Distributed Computing^ 1991 ^ pp. 
325-340. 

5. B. Chor, C. Dwork, “Randomization in Byzantine agreement” ^ Advances in Com- 
puting Research, vol. 5, 1989, pp. 443-498. 

6. D. Dolev, C. Dwork, L. Stockmeyer, “On the minimal synchronism needed for 
distributed consensus”, JACM, vol. 34, no. 1, Jan 1987, pp. 77-97. 

7. C. Fetzer, F. Cristian, “A highly available local leader election service”, IEEE 
Trans. Software Engineering, vol. 25, no. 5, Sep- Oct 1999, pp. 603-618. 

8. M.J. Fischer, N.A. Lynch, M.S. Paterson, “Impossibility of distributed consensus 
with one faulty process”, Journ. of the ACM, vol. 32, no. 2, Apr 1985, pp. 374-382. 

9. R. Gallager, P. Humblet, P. Spira, “A distributed algorithm for minimum weight 
spanning trees”, ACM Trans. Programming Languages nd Systems, vol. 4, no. 1, 
Jan 1983, pp. 66-77. 

10. I. Gupta, R. van Renesse, K.P. Birman, “A probabilistically correct leader 
election protocol for large groups” , Gomputer Science Technical Report 
ncstrl.cornell/TR2000-1794, Gornell University, U.S.A., Apr. 2000. 

11. A. Itai, “On the computational power needed to elect a leader”. Lecture Notes in 
Computer Science, vol. 486, 1991, pp. 29-40. 

12. G.-T. King, T.B. Gendreau, L.M. Ni, “Reliable election in broadcast networks”, 
Journ. Parallel and Distributed Computing, vol. 7, 1989, pp. 521-540. 

13. G. Malloth, A. Schiper, “View synchronous communication in large scale net- 
works”, Proe. 2nd Open Workshop of the ESPRIT project BROADCAST, Jul 1995. 

14. R. Ostrovsky, S. Rajagopalan, U. Vazirani, “Simple and efficient leader election in 
the full information model”, Proe.. 26th Annual ACM Symp. Theory of Computing, 
1994, pp. 234-242. 

15. O. Ozkasap, R. van Renesse, K.P. Birman, Z. Xiao, “Efficient buffering in reliable 
multicast protocols”, Proe. 1st Intnl. Workshop on Networked Group Communica- 
tion, Nov. 1999, Lecture Notes in Gomputer Science, vol. 1736. 

16. A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw- 
Hill International Edition, 3”^^ edition, 1991. 

17. D. Peleg, “Time optimal leader election in general networks”, Journ. Parallel and 
Distributed Computing, vol. 8, no. 1, Jan, 1990, pp. 96-99. 

18. R. De Prisco, B. Lampson, N. Lynch, “Revisiting the Paxos algorithm”, Proe. 
11^^ Intnl. Workshop on Distributed Algorithms, 1997, Lecture Notes in Gomputer 
Science, vol. 1320, pp. 111-125. 

19. M.O. Rabin, “Randomized Byzantine generals”, Proe. 2fth Annual Symp. Foun- 
dations of Computer Science, Nov. 1983, pp. 403-409. 

20. L.S. Sabel, K. Marzullo, “Election vs. consensus in asynchronous systems”, 
Gomputer Science Technical Report ncstrl.cornell/TR95-1488, Gornell University, 
U.S.A., 1995. 

21. S. Singh, J.F. Kurose, “Electing good leaders”, Journ. Parallel and Distributed 
Computing, vol. 21, no. 2, May 1994, pp. 184-201. 

22. G. Taubenfeld, “Leader election in the presence of n-1 initial failures”. Information 
Processing Letters, vol. 33, no. 1, Oct 1989, pp. 25-28. 

23. S. Toueg, “Randomized Byzantine agreements”, Proe. 3rd Annual ACM Symp. 
Principles of Distributed Computing, 1984, pp. 163-178. 

24. R. van Renesse, Y. Minsky, M. Hayden, “A gossip-style failure detection service”, 
Proe. Middleware ’98 (IFIP), Sept 1998, pp. 55-70. 

25. D. Zuckerman, “Randomness-optimal sampling, extractors, and constructive leader 
election”, Proe. 28th Annual ACM Symp. Theory of Computing, 1996, pp. 286-295. 




Approximation Algorithms for Survivable 
Optical Networks 

(Extended Abstract) 



Tamar Eilam^’^ Shlomo Moran^ Shmuel Zaks^ 

^ Department of Computer Science 
The Technion, Haifa 32000, Israel 
{eilam,moran,zaks}@cs .technion. ac . il 

^ IBM T.J. Watson Research Center 
Yorktown Heights, N.Y. 10598 

We are motivated by the developments in all-optical networks - a new tech- 
nology that snpports high bandwidth demands. These networks provide a set of 
lightpaths which can be seen as high-bandwidth pipes on which commnnication 
is performed. Since the capacity enabled by this technology snbst antially exceeds 
the one provided by conventional networks, its ability to recover from failnres 
within the optical layer is important. In this paper we stndy the design of a 
snrvivable optical layer. We assnme that an initial set of lightpaths (designed 
according to the expected commnnication pattern) is given, and we are targeted 
at angmenting this initial set with additional lightpaths snch that the resnlt will 
gnarantee snrvivability. For this pnrpose, we define and motivate a ring parti- 
tion survivability condition that the solntion mnst satisfy. Generally speaking, 
this condition states that lightpaths mnst be arranged in rings. The cost of the 
solution found is the number of lightpaths in it. This cost function reflects the 
switching cost of the entire network. We present some negative results regarding 
the tract ability and approximability of this problem, and an approximation al- 
gorithm for it. We analyze the performance of the algorithm for the general case 
(arbitrary topology) as well as for some special cases. 
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1 Introduction 

1.1 Background 

Optical networks play a key role in providing high bandwidth and connectivity 
in today’s commnnication world, and are cnrrently the prefered medinm for 
the transmission of data. While first generation optical networks simply served 
as a transmission medinm, second generation optical networks perform some 
switching and renting fnnetions in the optical domain. In these networks (also 
termed, all-optical) renting is performed by nsing Ughtpaths. A lightpath is an 
end-to-end connection established across the optical network. Every lightpath 
corresponds to a certain rente in the network, and it nses a wavelength in each 
link in its rente. (Two lightpaths which nse a same link are assigned different 
wavelengths.) Renting of messages is performed on top of the set of lightpaths 
where the rente of every message is a seqnence of complete lightpaths. At least 
in the near term the optical layer provides a static (fixed) set of lightpaths which 
is set np at the time the network is deployed. 

Since the capacity enabled by this technology snbstantially exceeds the one 
provided by conventional networks, it is important to incorporate the ability 
to recover from failnres into the optical layer. Survivability is the ability of the 
network to recover from failnres of hardware components. In this paper we stndy 
the design of a snrvivable optical layer. Onr goal is the constrnction of a low-cost 
survivable set of lightpaths in a given topology. We assume that an initial set of 
lightpaths (designed according to the expected communication pattern) is given, 
and we are targeted at augmenting this initial set with additional lightpaths 
such that the resulting set will guarantee survivability. For this purpose, we 
define a survivability condition that the solution must satisfy and a cost function 
according to which we evaluate the cost of the solution found. 

We focus on the ring partition survivability condition. Informally, this condi- 
tion states that lightpaths are partitioned to rings, and that all lightpaths in a 
ring traverse disjoint routes in the underlying topology. The motivation for the 
ring partition survivability condition is two folded. First, it supports a simple and 
fast protection mechanism. In the case of a failure, the data is simply re-routed 
around the impaired lightpath, on the alternate path of lightpaths in its ring. 
The demand that all lightpaths in one ring traverse disjoint routes guarantees 
that this protection mechanism is always applicable in the case of one failure. 
Second, a partition of the lightpaths to rings is necessary in order to support a 
higher layer in the form of SONET/SDH self healing rings which is anticipated 
to be the most common architecture at least in the near term future ([GLS98]). 

Another issue is determining the cost of the design. We assume that a uni- 
form cost is charged for every lightpath, namely, the cost of the design is the 
number of lightpaths in it. This cost measure is justified for two reasons. First, 
in regional area networks it is reasonable to assume that the same cost will be 
charged for all the lightpaths ( [RS98]). Second, every lightpath is terminated 
by a pair of line terminals (LTs, in short). The switching cost of the entire net- 
work is dominated by the number of LTs which is proportional to the number 
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of lightpaths ( [GLS98]). 

We assume that the network topology is given in the form of a simple graph. 
A lightpath is modeled as a pair {ID^ P) where is a unique identifier and 
E* is a simple path in the graph. A design D for a set of lightpaths C is a 
set of lightpaths which subsumes C (i.e., C C L^). A design is termed ring 
partition if it satisfies the ring partition condition. The cost of a design is the 
number of lightpath in it (namely, cost{D) = We end up with the following 
optimization problem which we term the minimum cost ring partition design 
(MCRPD in short) problem. The input is a graph G and an initial set C of 
lightpaths in G. The goal is to find a ring partition design D for G with minimum 
cost. 

1.2 Results 

We prove that the MCRPD problem is NP-hard for every family of topologies 
that contains cycles with unbounded length, e.g., rings (see formal definition in 
the sequel). Moreover, we prove that there is no polynomial time approximation 
algorithm A that constructs a design D which satisfies Gost{D) < OPT + n", 
for any constant a < 1, where n is the number of lightpaths in the initial set, 
and OPT is the cost of an optimal solution for this instance (unless P = NP), 
For <a = 1, a trivial approximation algorithm constructs a solution within this 
bound. 

We present a ring partition algorithm (RPA, in short) which finds in poly- 
nomial time a ring partition design for every given instance of MCRPD (if it 
exists) . We analyze the performance of RPA and show that for the general case 
(arbitrary topology) RPA guarantees 

Gost[D) < min[OPT -h | • n,2n), where n and OPT are as defined above. We 
analyze the performance of RPA also for some interesting special cases in which 
better results are achieved. 

The structure of the paper follows. We first present the model (Section 2), 
followed by a description of the MCRPD problem (Section 3). We then discuss 
the results (Section 4), followed by a summary and future research directions 
(Section 5). Some of the proofs in this extended abstract are only briefiy sketched 
or omitted. 

1.3 Related Works 

The paper [GLS98] studies ring partition designs for the special case where the 
physical topology is a ring. In fact, the MCRPD problem is a generalization of 
this problem for arbitrary topologies. This paper also motivates the focus on the 
number of lightpaths rather than the total number of wavelengths in the design. 
Some heuristics to construct ring partition designs in rings are given and some 
lower and upper bounds on the cost (as a function of the load) are proved. The 
paper also considers lightpath splitting - a lightpath might be partitioned to two 
or more lightpath. It is shown that better results can be achieved by splitting 
lightpaths. 
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Other works in this field refer to different models than what we considered. 
[GRS97] presents methods for recovering from channel, link and node failnres in 
first generation WDM ring networks with limited wavelength conversion. 

Other works refer to second generation optical networks, where traffic is 
carried on a set of lightpaths. The paper [RS97] assnmes that lightpaths are 
dynamic and focnses on management protocols for setting them np and taking 
them down. 

When the set of lightpaths is static, the snrvivability is achieved by providing 
disjoint rontes to be nsed in the case of a failnre. [HNS94] and [AA98] stndies this 
problem bnt the objective is the minimization of the total nnmber of wavelengths 
and not the nnmber of lightpaths. 

The paper [ACB97] offers some henristics and empirical resnlts for the fol- 
lowing problem. Given the physical topology and a set of connections reqnests 
(i.e., reqnests for lightpaths in the form of pairs of nodes), find rontes for the re- 
quests so as to minimize the number of pairs (/, e) consisting of a routed request 
(i.e., a lightpath) I and a physical link e, for which there is no alternative path 
of lightpaths between the endpoints of I in the case that e fails. Note that this 
survivability condition is less restrictive than the ring partition condition that 
we consider in this paper. 

2 Model and Definitions 

For our purposes, lightpaths are modeled as connections^ where every connection 
c has a unique identifier ID{c) and is associated with a simple path lZ{c) in the 
network. IZ is termed the routing function. Note that two different connections 
might have the same route. We assume that routes of connections are always 
simple (i.e., they do not contain loops). We say that two connections are disjoint 
if their routes are disjoint, namely, they do not share any edge and any node 
which is not an end node of both connections. We use the terms connections and 
lightpaths interchangeably. 

A virtual path T* is a sequence ('Ci , ci, 1 ^ 2 ? ^2, • * * ? c/?? where c* is a 
connection with endpoints Vi and (for i = 1 , • * *5 ^)* ^ termed a virtual 
cycle if vi = Vk-\-i. We denote by S{P) the set {ci, C 2 , • • ' , c^} of connections 
in P. The routing function IZ is naturally generalized to apply to virtual paths 
(and cycles) by concatenating the corresponding paths of connections. A virtual 
path (or cycle) P is termed plain if 1Z{P) is a simple path (or cycle) in the 
network. 

A design D for a set of connections G in a network G is a set of connections 
which subsumes G (i.e., C C D). A ring partition design D for a set of connec- 
tions G satisfies D — Ut£TS{Pt), where every Pt, t G T, is a plain virtual cycle, 
and S{Pt^)r\S{Pt^) = 0, for every ti,t 2 £ T- The partition {Pt}teT is termed the 
ring partition of the design D. For a design D, cost{D) = \D\, i.e., the number 
of lightpaths in the design. 

The minimum cost ring partition design (MCRPD, in short) problem is for- 
mally defined as follows. The input is a graph G and a set of connections G in 
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G. The goal is to find a ring partition design D for C that minimizes cost{D). 
The corresponding decision problem is to decide for a set of connections C iiv G 
and a positive integer s whether there is a ring partition design D for G snch 
that cost{D) < s. 

MCRPD^ denotes the version of the problem in which the inpnt is restricted 
to a family Q of networks (e.g., the family 7^ of rings). 

Fignre 1 is an example of the MCRPD problem, where (a) shows an instance 
with an initial set of size 4, and (6) shows a solntion which consists of 2 rings 
and 3 new connections. The cost of the solntion is thns 7. 



e d 





Fig. 1. The MCRPD problem. 



3 The MCRPD Problem 

In this section we start onr stndy of the MCRPD problem by providing some 
negative resnlts regarding the tract ability and approximability of the problem. 

We say that a family of topologies Q — G\^ G 2 , * * * has the unbounded cycle 
(UBC) property if there exists a constant k, snch that for every n, there exists 
a graph Gi^ E Q, with size O(n^), that contains a cycle of length n. Examples 
for families of topologies having the UBC property are the family IZ of ring 
topologies, and the family of complete graphs. 

Theorem 1 . The MCRPDg problem is NP-hard for every family of topologies 
Q having the UBC property. 

Proof See [EMZOO]. 

We continne by stndying approximation algorithms for the MCRPD problem. 
A trivial approximation algorithm is achieved by adding for every connection c, 
a new disjoint connection between c’s endpoints. Note that if there is no snch 
ronte then there is no ring partition design for this instance. The resnlting ring 
partition design will inclnde virtnal cycles, each with two connections, one of 
which belongs to the initial set G . For an algorithm A, we denote by A(I) the 
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value of a solution found by A for an instance /, and by OFT(I) the value of 
an optimal solution. Clearly, TlilV(I) = 2n < OFT(I) + n, for every instance 
I = (O, C) of MCRPD, where \C\ = n. A question which arises naturally is 
whether there exists an approximation algorithm A for the MCRPD problem 
that guarantees, A(I) < OFT(I) + n", for some constant a < 1. We give a 
negative answer for this questions (for every constant a < 1). 

Theorem 2. Let Q be any family of topologies having the UBC property. Then 
for any constant a < 1, MCRFDg has no polynomial-time approximation algo- 
rithm A that guarantees A[I) < OFT[I) + n" (unless F = NF). 

Froof. See [EMZOO]. 

The next question is whether there is an approximation algorithm A for 
MCRPD which guarantees A{I) < OFT{I) + • n, where /? < 1 is a constant 

(clearly, the trivial algorithm TRIV satisfies this bound for k — 1). In the sequel 
we answer this question positively for /? = |. 

4 A Ring Partition Approximation Algorithm 

In this section we provide an approximation algorithm, termed ring partition 
algorithm (RPA, in short), for the MCRPD problem. We analyze RFA and 
show that it guarantees RFA[I) < min[OFT[I) + | • n,2n) for every instance 
I (where n is the number of connections in the initial set). We also study some 
special cases in which better results are achieved. 

Unless stated otherwise we assume an arbitrary network topology G — (U, U), 
where U = {i;i, • — and an initial set of connections C in G, where \C\ — n. 

We assume that the route F{c) of every connection c in C is a sub-path in some 
simple cycle in G (observe that this assumption can be verified in polynomial 
time, and without it there is no ring partition design D for G). 

4.1 Preliminary Constructions 

We define some preliminary constructions that are used later for the definition 
of RFA. Recall that a virtual path P is a sequence (i;i, ci, 1 ^ 2 ? ^2, • * * ? c/?? 
where c* is a connection with endpoints Vi and Vij^i (for i — 1, ••*,/?). F is 
termed a virtual cycle if vi = Vk-\-i. The pair of connections c* and c*_^i are 
termed attached at node Vi^i in F (or simply, attached in P). If P is a virtual 
cycle then the pair ci and are also considered attached (at node Vk-\-i) in P. 

Let C be a set of connections in G, and let 1 ; be a node in G. We denote 
by G{v) C G the set of connections for which v is an endpoint. Let Q be the 
symmetric binary relation over the set G of connections that is defined as follows, 
(ci, C 2 ) G Q iff Cl and C 2 are disjoint and there exists a simple cycle in G which 
contains both routes lZ{ci) andP(c 2 ). Then Q defines an end-node graph NGy = 
{NVy, N Ey) for every node c, where the set of nodes NVy is G{v), and N Ey is the 
set of edges, as follows. For every pair of connections c*, Cj G G{v), {c*, c^} G N Ey 
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iff (c*, Cj) G Q. A matching for a graph G — {V, E) is a set E' C E such that no 
two edges in E^ share a common endpoint. A maximum matching is a matching 
of maximum size. We denote by match{G) the size of a maximum matching 
for G. A matching in an end-node graph NGy, for a node v describes a set of 
attachments of pairs of connections (which satisfy Q) in v. 

Consider a graph G — (V,E), where C = {'Ci , 'C 2 w * * nm}, and a set of 
connections G in G. K matching-set for G and C is a set of matchings 8 — 
{N E'^^^ NE'^^, • • • , NE'^^}, where NE'^^ C NEy^ is a matching in the end-node 
graph NGy^ (see Figure 2 as an example). 




Fig. 2 . A graph, a set of connections, a matching-set (where only matchings in 
non-trivial end-node graphs are shown), and the equivalent subgraph-partition. 



A subgraph-partition Q — for a set of connections C, is a partition 

of the connections in G into virtual paths and cycles (which are also termed 
subgraphs) as follows. Recall that S{g) is the set of connections that are included 
in a virtual path (or cycle) g. Qp is a set of virtual paths, Qc is a set of virtual 
cycles, G — Uy^^A(^), and S[gi) Pi A(^ 2 ) = 0 for every ^ 1,^2 C Q. Note that the 
ring partition {Pt]teT of a ring partition design D — \Jt^TS{Pt) is actually a 
subgraph-partition for D (where, ^ = 0). In general the virtual paths 

and cycles in a subgraph-partition might not be plain. 

Note that there is a one-to-one correspondence between matching- 
sets and subgraph-partitions, as follows. Consider a matching-set 8 — 

• • *, AA^^} and a subgraph-partition Q — Qp \J for a set of 
connections G in G. 8 and Q are termed eguivalent if the following condition 
is satisfied. For every pair of connections ci,C 2 G C, there exists a subgraph g, 
g ^ such that Cl and C 2 are attached at node Vi in g, iff {ci, C 2 } G N E'^^. 

For a matching-set 8 we denote by Ge fhe (unique) equivalent subgraph- 
partition. Similarly, 8g is the (unique) equivalent matching-set for a given 
subgraph-partition G • Clearly, for a matching-set 8, 8g^ = 8. As an example 
see Figure 2. 
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4.2 Ring Partition Algorithm (RPA) 

We present a ring partition algorithm, called RPA, which finds a ring partition 
design for a set of connections C in G in fonr main stages. First, the end-node 
graph NGv^ is constrncted and a maximnm matching in it is fonnd for every node 
i — 1, • • • , m. This defines a maximnm matching-set S. Then, the eqni valent 
snbgraph-partition Q — constrncted. Next, we partition every non-plain 

virtnal path or virtnal cycle in Q to plain virtnal paths. In addition, we make 
snre that for every virtnal path P G there is a simple cycle in G in which 
P(P) is a snb-path. Last, the snbgraph-partition is completed to a ring partition, 
by adding for every virtnal path P G a connection which completes it to a 
plain virtnal cycle. Following is the description of RPA followed by an informal 
description of the operations taken by its main fnnctions. 

1: RPA(G,C) 

2: (^p, Qc) := ConstructPartition{G, G) 

3: {Qp, Qc) -= AdjustPartition(Qp, Qc, G) 

4: D := G U GompletePartition(Qp, Qc, G) 

5: return D 

6: GonstructPartition(G, G) 

7: for every i G 1, • • • , m 

8: construct NGv^ = {NVv^, N Ev^) 

9: find maximum matching NE^. C NEy^ 

10: £:={7VP;^,7VP;,,...,7VP;^} 

11: construct the equivalent subgraph-partition Qg = (Qp^Qc) 

12: return {Qp,Qc) 

13: AdjustPartition(Qp, Qc, G) 

14: for every P ^ QpU Qc 

15: Qc := Qc \ {P} /* in case P is a cycle */ 

16: Gp := Partition(P) 

17: Qp:={Qp\{P})UGp 

18: for every P £ Qp 
19: if (cycP(P)) then 

20: Qp := Qp \ {P} 

21: ^c:=^cU{P} 

22: return {Qp,Qc) 

23: GompletePartition{Qp^Qc,G) 

24: P' := 0 

25: for every P ^ Qp 

26: P^ := findDisjoint(P) 

27: PA=P'U{P^} 

28: return D' 

29: Partition(P) 

30: Assume that P := (ai , ci , ^ 2 , C 2 , • • • , a;, c;, 

31 Gp := 0; first := 1 
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32: for « := 1 to / 

33: P := (^v first} c/irst? ’ ’ ’ } c-i , 

34: if ( -^(plain(P^) A cycleExists(P^))) then 

35: Cp := Cp U c/irst? * * * 5 15 

36: first := i 

37: return Cp U {{vf^rst} Cfrrst} * * * , c;, vipi)} 



The function ConstructPartition hrst constructs the end-node graphs. The al- 
gorithm to construct the end-node graphs is straightforward and is not elaborated. It 
consists of determining, for every pair of connections with a common endpoint, whether 
they are disjoint, and whether the path that is formed by concatenating them can be 
completed to a simple cycle in G. This could be done using standard BPS techniques 
(see, e.g., [Eve79]). ConstructPartition then hnds maximum- matchings in the end- 
node graphs. Efficient algorithms for Ending maximum matchings in graphs can be 
found in, e.g., [MV80] (for a survey see [vL90], pages 580-588)). Last, the construction 
of the equivalent subgraph-partition is straightforward. 

The function Adjust Partition partitions every virtual path and virtual cycle in 
the subgraph-partition using the function Partition. After the partition, every virtual 
path is plain and can be completed to a simple cycle in G. Every virtual path is then 
checked and if it is actually a cycle (i.e., its endpoints are equal) then it is inserted into 

The task of Partition is to partition a virtual path (or cycle) to a set {-Pi, • • • , Pi} 
of plain virtual paths, s.t. for every Pi^ lZ{Pi) is a sub-path in some simple cycle in G. 
The function cycleExists(P) returns true if there is a disjoint path in G between P’s 
endpoints. The function cycle{P) returns true if the endpoints of a given virtual path 
are equal. 

Last, the function Complete Partition completes every virtual path in Qp to a 
virtual cycle by adding a new disjoint connection P^ between P’s endpoints. 



4.3 Correctness and Analysis 

We first present four observations that are used for the proof of the main theorem 
(Theorem 3). Observation 1 shows a connection between the sizes of matching-sets 
and the equivafent subgraph-partitions. 

Observation 1 Let E = * * * ? ct matching-set for a set of 

connections C in G = (V, E), where |C| = n, and V = |ci, • • • , Cm,}. Let Qg = Qp U Q^. 
he the equivalent subgraph-partition. Then \^p\ = n — 

Proof. Let an attachment point in Qg be an ordered pair (|ci,C 2 },c), where the con- 
nections Cl and C 2 are attached at node v in some subgraph g £ Qs- Cfearfy the number 
of unique attachment points in a virtuaf path Px G Qp is one fess than the number of 
connections in P^,, i.e., |*S'(Pa,)| — 1 . The number of unique attachment points is equai 
to |*S'(Pa,)| if Px G is a virtuaf cycfe. ft foffows that the number of unique attachment 
points is equai to I *^(5^)1) “ \^p\ = n — \Qp\. Now by the definitions there is a 

one-to-one correspondence between attachment points and edges in the matchings, ft 
foffows that the number of attachment points is equai to the number of edges in the 
matching set, i.e, n — \Qp\ = \NEy^\. 
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Let Q{D) be a subgraph-partition for a set of connections D. The projection Q{D)\c 
of Q{D) on a set of connections C G D is a subgraph-partition for C which is obtained 
from Q{D) by deleting all the connections that are not in C (i.e., all the connections 
in D \C). Note that a virtual path (or cycle) in Q{D) might be cut by this process 
into few virtual paths. Similarly, let ^{D) be a matching-set for D. Then the projection 
S(D)\ c of ^(Zl) on a set of connections C G Zl, is a matching-set for C which is 
obtained from Z’(ZZ) by deleting from the end-node graphs (and the matchings) nodes 
which correspond to connections in ZZ \ (7 and the edges that meet them. Clearly, if 
Q{D) and Z^(ZZ) are equivalent then so are ^(ZZ)|c and E{D)\c^ 

Consider a ring partition design D = \Jt^TS{Pt) for a set of connections C. We 
denote by Q{D) the ring partition {Pt}teT of ZZ, and by E{D) the equivalent matching- 
set for ZZ (i.e., E{D) = Z’^(d))- The subgraph-partition ^(ZZ)|c and the matching-set 
S(D)\ c for the initial set of connections (7 are termed the induced sut)Qrctph~pctrtition 
and the induced matching- set^ respectively (note that they are equivalent). Obser- 
vation 2 associates the cost of ring partition designs, with the sizes of the induced 
matching- sets and subgraph-partitions. 

Observation 2 Let D = UteTS(Pt) he a ring partition design for a set of connections 
C in a physical topology G = (V, E), where |(7| = n, and \V\ = m. Let Z’(ZZ)|c = 
{N N ‘ , N and Q{D)\c = Qp G Qc he the induced matching-set and 
subgraph-partition for C. Then cost(D) > n -h \Qp\ = 2n — 

Proof, By the definitions, Cost{D) = S{Pt). Let new{Pt) be the number of new 

connections in the virtual cycle Pt, i.e., new{Pt) = S{Pt)r\{D\C). Clearly, Cost{D) = 
n -h new{Pt). Consider now the induced subgraph partition Q{D)\c = QpG Qc- 

Recall that it is obtained from ZZ by deleting all the new connections. In this process 
a virtual cycle in the ring-partition might be cut into few virtual paths. Clearly the 
number of such virtual paths for each virtual cycle, is at most the number of new 
connections in it. It follows that \^p\ < neuj(Pt), thus Cost{D) > n \Qp\. by 

Observation 1, n -h \^p\ = 2n — Note that strict inequality occurs when 

two new connections are attached in one of the virtual cycles. 

A maximum-matching-set, is a matching set £ = {NE^^, • • • , NE^^f for a set of 
connections (7, s.t. the matching NE^^ is a maximum matching for the end-node graph 
NGv^, for every « = 1, • • • , m. Recall that match{G) is the size of a maximum matching 
for G. Observation 3 is a lower bound on the value of an optimal solution. 

Observation 3 Every ring partition design D for G satisfies cost{D) > 2n — 
match(NGv^) (where^ n and m are defined as above). 

Proof, Let ZZ = UteTS{Pt) be a ring-partition design for G. Note that every two 
connections that are attached in a virtual cycle Pt, t £ T, in the design satisfy the 
relation Q, i.e., they are disjoint and there is a simple cycle that contains both routes. 
Clearly, the same holds also for the induced sub graph- partition ^(ZZ)|c and matching- 
set (since we only delete connections). Consider the equivalent matching set Z’(ZZ)|c = 
{N El^, N El^, ‘ ‘ , N El^ f, In follows that NE^^ is actually a matching in the end- 
node graph NGv^, for i = 1, • • • , m, and thus > match{N Ev^). It follows, by 

Observation 2, that Gost{D) > 2n — J. 

Consider a ring partition design ZZ = Gt^TS{Pt) for a set of connections G in G. Let 
new{Pt) be the number of new connections in S{Pt) (i.e., connections in S{Pt)C\{D\G)), 
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A canonical Tvug partition design satisfies that new{Pt) < 1 for every t £T. Note that 
it is always possible to construct from a given ring partition design D, a canonical ring 
partition design D' such that Cost(D^) < Cost(D) as follows. Let Q(D)\c = Qp ^ Qc 
be the induced subgraph partition of D. To construct a canonical ring partition design 
D' with at most the same cost we complete every virtual path in Qp to a plain virtual 
cycle by adding one new connection. (This is always doable since every virtual path 
in Qp is plain and is included in some simple cycle in G). By the discussion above, 
Cost{D^) = n -\- \Qp\ < Cost{D). Observation 4 follows. 

Observation 4 If there is a ring partition design for a set of connections C in G then 
there is a canonical ring partition design with minimum cost. 

It can be proved that Observation 2 holds for canonical ring partition designs D' 
with equality i.e., cost{D') = n \ Qp\. It is therefore sometimes convenient to consider 
for simplicity only canonical ring partition designs. 

We are now ready to prove the main theorem. 

Theorem 3. RPA(I) < min(OPT(/) + 1 • n, 2n), for every I = (G, G), where |G| = n. 

Sketch of Proof: For the analysis we denote by Qp and Q\ the sets Qp and Q^ right 

after the execution of G on struct Partition^ and by Q^ and Q^ the corresponding sets 
right after the execution of Adjust Par tit ion. 

We now examine the partition procedure Partition. Recall that the end-node 
graphs are constructed w.r.t. the relation Q which is true for a pair of connec- 
tions Cl and C 2 iff their routes 7^(ci) and 7^(c2) are disjoint and there is a sim- 
ple cycle which contains both routes (as sub-paths). Consider a virtual path P = 
(ci , Cl , C 2 , C 2 , • • • , , c;, c;) G Qp - Since P is a virtual path in the equivalent subgraph- 

partition Qg, it holds that (cq c^qi) G Q, for every « = 1, ••*,/ — 1. Let Gp be the set 
of virtual paths which is the output of Partition{P). By the above discussion, and 
by the dehnition of Partition, at most one virtual path in Gp contains less than two 
connections. Such a virtual path can be only the last one, which contains the connec- 
tion Cl. Let np = |S'(P)| (i.e., the number of connections in the virtual path P). Let 
nip = \Gp\ (i.e., the number of plain virtual paths that are the result of applying the 
partition procedure on P). It follows that mp < [ ^ J . 

Now consider a non-plain virtual cycle P ^ Ql^. Then, by the same considerations, 
mp A L ^ J ? where np and mp are dehned similarly. 

Let C Q\ and Q^f C Q\ be the sets of non-plain virtual cycles with, respec- 

tively, odd and even number of connections, after GonstructPartition. Note that 
G ompjlete Partition adds one new connection for every virtual path P £ Qp. We get, 
RPA{I) = \Ql\ + n 

^ Ep6g'(^ + I) + + Zlpggi(^ + I) + 

<f + IK\ + IK\ 

Observe that a non-plain virtual cycle in Qp contains at least 4 connections, since 
otherwise clearly there are two consecutive connections that are not disjoint in the 
cycle, which is not possible by the dehnition of the algorithm. It follows that |^c| N 
J. We get, RPA(I) A n ^ ~ n ^\Qp \ . Now, by Observation 3, we can show that 
OPT(I) > n \Qp\ (since in the hrst step RPA hnds maximum matchings in the 
end-node graphs). Thus, RPA{I) < OPT{I) -h | • u . 

Observe that RPA constructs a canonical solution, i.e., there is at most one new 
connection in every ring. Clearly, there is at least one connection from the initial set 
in every ring. It follows, RPA{I) < 2n. ■ 
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Note that since OPT(I) > n, this is actually better than a |-approximation. 

Time complexity. The time complexity depends on the exact format of the input for 
the algorithm and the data structures which are used in order to represent the physical 
topology, the set of connections and the auxiliary combinatorial constructions (i.e., the 
end-node graphs, and the subgraph partition). It is clear however that this time is 
polynomial in the size of C and G. It is well-known that it takes 0(^|V^| • |£^|) time to 
find a maximum matching in a graph G = (V, E) ( [MV80]) and that it takes 0{\E\) 
time to find whether two paths are disjoint, or whether there exists a disjoint path 
between a given path’s endpoints. For special topologies these tasks can be significantly 
simpler. For instance, clearly in the ring physical topology case, every plain virtual 
path can be completed to a plain virtual cycle, thus the relation Q can be simplified to 
Q(ci,C 2 ) = disjoint{ci, C 2 ). The end-node graphs are bipartite, and finding maximum 
matchings in bipartite graphs is considerably easier ([vL90]). Also, to find a disjoint 
path between the endpoints of a given simple path is trivial. In any case, for the 
applications of RPA for the design of optical networks time-efficiency is not crucial since 
the algorithm is applied only in the design stage of the network and it is reasonable to 
invest some preprocessing time once in order to achieve better network designs. 

4.4 Special Cases 

4.5 Optimal Cases 

Since the MCRPD problem is NP-hard (Theorem 1) it is natural to try and find re- 
stricted families of topologies for which it can be solved in polynomial time. Unfor- 
tunately, we actually proved in Theorem 1 that the MCRPD problem is NP-hard for 
every family of topologies that contains cycles with unbounded length (e.g., rings). 
Since trees do not support ring partition designs, this implies that the problem is NP- 
hard for every family of topologies which is of interest in this setting. This observation 
motivates the question of finding polynomially solvable classes of instances of the prob- 
lem when taking into account not only the topology of the network but also the initial 
set of connections. 

The induced graph IGc = {IVc ^ lEc) for a set of connections C in C is the sub- 
graph of G which includes all the edges and nodes of G that are used by at least one 
connection in G , 

A natural question is whether applying restrictions on the induced graph suffices to 
guarantee efficient optimal solution to the problem. We answer this question negatively 
by showing that the problem remains NP-hard even for the most simple case where 
the induced graph is a chain. 

Theorem 4. The MCRPD problem is NP-hard even if the induced graph for the set of 
connections G in G is a chain (or a set of chains). 

Next we show that if, in addition to an induced graph with no cycles, the network 
topology satisfies a certain condition (w.r.t. the initial set of connections), then RPA 
finds a minimum cost ring partition design. 

Theorem 5. RPA{T) = OPT{T) for every instance I = (G, G) which satisfies the 
following two properties. 

No Cycles. The induced graph IGc = {INc, lEc) is a forest. 
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Completion. For every plain virtual path P over C , there is a simple cycle in G that 
contains the route of P , 1Z.{P), as a sub-path. 

We discuss below some cases in which the conditions in Theorem 5 are satished. 
A perfectly-connected graph (PC, in short) satishes that every simple path in it is 
included in a simple cycle. Clearly, if a graph is perfectly connected than the completion 
property is satished for every initial set of connections. This property also guarantees 
that there is a ring partition design D for every initial set of connections C. A natural 
question is to characterize perfectly connected graphs. We give a full characterization of 
perfectly connected graphs by proving that a graph is PC iff it is randomly Hamiltonian. 
Randomly Hamiltonian graphs are dehned and characterized in [CK68]. 

Theorem 6. A graph G is perfectly connected iff it is one of the following: a ring, a 
complete graph, or a complete bipartite graph with egual number of nodes in both sets. 

We note that RPA does not have to be modihed in order to give an optimal result for 
instances which satisfy the conditions in Theorem 5. However, we can beneht from rec- 
ognizing in advance such instances since in these cases the procedure Adjust Partition 
can be skipped. The Recognition can be done easily for specihc topologies (e.g., rings), 
and in polynomial time in the general case. 

4.6 Bounded Length Connections in Rings 

We analyze the performance of RPA in the case of a ring physical topology, when there 
is a bound on the length of connections in the initial set. 

Theorem 7. RPA{I) < min(OPT(/) -h ^ • n,2n), for every instance I = {Rm,G) 
of MCRPDjz , if for every connection c £ G, length{lZ{c)) < k, for any constant k, 
1 < k < m — 1, 

Note that RPA does not guarantee that the same bound on the length holds also for 
connections in the ring partition design which is constructed. Indeed, the case where 
the length of connections in the solution must be bounded is inherently different, and 
the main results in this paper do not hold for it. 



4.7 Approximations Based on the Load 

Let the load R of an edge e G E be the number of connections in G which use e, and 
Ij = maXeGsL- Recall the dehnition of an induced graph IGc = (IVcffEc) for a 
set of connections C in C (Section 4.5). We add to this dehnition a weight function 
w : lEc — >■ E that assigns a weight for every edge that is equal to its load. Although 
in the worst case the load of an instance is equal to the number of connections |(7|, 
usually it is substantially smaller. Therefore, it is interesting to bound the cost of a 
design as a function of the load. 

For this purpose, we assume that the route of every virtual path is a sub-path is 
some simple cycle in G (i.e., the completion property). Let W = L- Now consider 

the weighted induced graph IGc = {IVc Ec ,Wc) for G. Let Tmax be a maximum- 
weight spanning tree in IGc, — X^eGT = W — 

Following is a description of a modihed version of RPA, termed RPA;. We temporarily 
remove all connections that use edges that are not in Tmax- Next, we hnd a ring 
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partition design for the remaining set of connections (using RPA). Last, we insert back 
the removed connections and complete each one of them to a virtual cycle by adding a 
new connection. We prove that the cost of the resulting ring partition design is larger 
by at most 21LG-T^a^ than the optimal one. (Note that an improved heuristics might 
be to repeat the same process with the remaining set of connections.) 

Theorems. RPAi{I) < OPT{I) + 2WG-Tjr,axy every instance I = (G^C) which 
satisfies the completion property. 

For the case of a ring physical topology, it holds RPAfil) < OPT{I) + miuees L- 
A slightly better bound is given for this case in [GLS98]. 

Note that there might be a set of connections with size smaller than IFG-T^aa^ 

such that the induced graph for the remaining set C \ is a forest. However, we 

prove in Proposition 9 that finding a minimum set of connections whose removal leaves 
us with an induced graph with no cycles is NP-hard. 

Proposition 9. Finding a minimum set of connections C' G C in a graph G such that 
the induced graph for the remaining set G \G' does not contain cycles is NP-hard, 



5 Summary and Future Research 

In this paper we studied the MCRPD problem for which the input is an initial set 
of lightpaths in a network and the goal is to augment this set by adding lightpaths 
such that the result is a ring partition design with minimum cost. We have shown an 
approximation algorithm for this problem that guarantees Gost{D) < min(OPT + k • 
n,2n), where /c = |, n is the number of lightpaths in the initial set, and OPT is the 
cost of an optimal solution. Moreover, we have shown that, unless P = NP, there is no 
approximation algorithm A for this problem that guarantees Gost{D) < OPT + 
for every constant a < 1. The main open question here is whether the constant k can 
be improved. 

Ring partition designs are necessary for the near term future of optical networks 
since they support a SONET higher layer network which is configured in the form of 
rings. However it is claimed that the core network architecture will have to change 
and that SONET will give way to a smart optical layer. Incorporating new technolo- 
gies it might be possible to re-route lightpaths dynamically. In these cases other less 
restrictive survivability conditions might be considered. While less restrictive surviv- 
ability conditions might be less expensive to implement, the price to pay is of a more 
complex protection mechanism that is executed for every failure. The challenge here 
is two folded. First, to study the gain in the cost of the network when less restrictive 
survivability conditions are considered. Second, to study the algorithmic and techno- 
logical issues of implementing protection mechanisms in the optical domain based on 
these conditions. 
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Abstract. This paper presents a study of a distributed cooperation 
problem under the assumption that processors may not be able to com- 
municate for a prolonged time. The problem for n processors is defined 
in terms of t tasks that need to be performed efficiently and that are 
known to all processors. The results of this study characterize the ability 
of the processors to schedule their work so that when some processors 
establish communication, the wasted (redundant) work these processors 
have collectively performed prior to that time is controlled. The lower 
bound for wasted work presented here shows that for any set of schedules 
there are two processors such that when they complete ti and tasks re- 
spectively the number of redundant tasks is For n = t and for 

schedules longer than the number of redundant tasks for two or more 
processors must be at least 2. The upper bound on pairwise waste for 
schedules of length ^/n is shown to be 1. Our efficient deterministic sched- 
ule construction is motivated by design theory. To obtain linear length 
schedules, a novel deterministic and efficient construction is given. This 
construction has the property that pairwise wasted work increases grace- 
fully as processors progress through their schedules. Finally our analysis 
of a random scheduling solution shows that with high probability pair- 
wise waste is well behaved at all times: specifically, two processors having 
completed ti and tasks, respectively, are guaranteed to have no more 
than tit'z/t + A redundant tasks, where A = 0(log n + ^/tlt 2 /t^/logn). 



1 Introduction 

The problem of cooperatively performing a set of tasks in a decentralized set- 
ting where the computing medium is subject to failures is a fundamental prob- 
lem in distributed computing. Variations on this problem have been studied in 
in message-passing models [3,5,7], using group communications [6,9], and in 
shared-memory computing using deterministic [12] and randomized [2, 13, 16] 
models. 
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We consider the abstract problem of performing t tasks in a distributed en- 
vironment consisting of n processors. We refer to this as the do-all problem. 
The problem has simple and efficient solutions in synchronous fault-free sys- 
tems; however, when failures and delays are introduced the problem becomes 
very challenging. Dwork, Halpern and Waarts [7] consider the do-all problem 
in message-passing systems and use a work measure W defined as the number of 
tasks executed, counting multiplicities, to assess the computational efficiency. A 
more conservative measure [5] includes any additional steps taken by the proces- 
sors, for example steps taken for coordination and waiting for messages. Commu- 
nication efficiency M is gauged using the message complexity, accounting for all 
messages sent during the computation. It is not difficult to formulate solutions 
for DO-ALL in which each processor performs each of the t tasks. Such solutions 
have W = ' n), and they do not require any communication, i.e., M = 0. 

Another extreme is the synchronous model with fail-stop processors, where each 
processor can send 0-delay messages to inform their peers of the computation 
progress. In this case one can show that W — 0(t + nlogn/loglogn). This work 
is efficient (there is a matching lower bound, cf. [12]), and the upper bound does 
not depend on the number of failures. However the number of messages is more 
than quadratic, and can be log n/ log log n) [3]. Thus satisfactory solutions 

for DO-ALL must incorporate trade-off between communication and computation. 

In failure- and delay-prone settings it is difficult to precisely control the trade- 
off between communication and computation. In some cases [7] it is meaningful 
to attempt to optimize the overall effort defined as the sum of work and mes- 
sage complexities, in other cases [5] an attempt is made to optimize efficiency 
in a lexicographic fashion by first optimizing work, and then communication. 
For problems where the quality of distributed decision-making depends on com- 
munication and can be traded off for communication, the solution space needs 
to consider the possibility of no communication. Notably, this is the case in 
the load-balancing setting introduced by Papadimitriou and Yanakakis [18] and 
studied by Georgiades, Alavronicolas and Spirakis [8]. In this work we study the 
ability of n processors to perform efficient scheduling of t tasks (initially known 
to all processors) during prolonged periods of absence of communication. 

This setting is interesting for several reasons. If the communication links are 
subject to failures, then each processor must be ready to execute all of the t tasks, 
whether or not it is able to communicate. In realistic settings the processors 
may not initially be aware of the network configuration, which would require 
expenditure of computation resources to establish communication, for example 
in radio networks. In distributed environments involving autonomous agents, 
processors may choose not to communicate either because they need to conserve 
power or because they must maintain radio silence. Regardless of the reasons, 
it is important to direct any available computation resources to performing the 
required tasks as soon as possible. In all such scenarios, the t tasks have to be 
scheduled for execution by all processors. The goal of such scheduling must be to 
control redundant task executions in the absence of communication and during 
the period of time when the communication channels are being (re) established. 
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For a variation of do-all Dolev et al [6] showed that for the case of dynamic 
changes in connectivity, the termination time of any on-line task assignment 
algorithm can be greater than the termination time of an off-line task assignment 
algorithm by a factor linear in n. This means that an on-line algorithm may not 
be able to do better than the trivial solution that incurs linear overhead by 
having each processor perform all the tasks. With this observation [6] develops 
an effective strategy for managing the task execution redundancy and prove that 
the strategy provides each of the n processors with a schedule of tasks 

such that at most one task is performed redundantly by any two processors. 

In this work we advance the state-of-the-art with the ultimate goal of devel- 
oping a general scheduling theory that helps eliminate redundant task executions 
in scenarios where there are long periods of time during which processors work 
in isolation. We require that all tasks are performed even in the absence of com- 
munication. A processor may learn about task executions either by executing a 
task itself of by learning that the task was executed by some other processor. 
Since we assume initial lack of communication and the possibility that a pro- 
cessor may never be able to communicate, each processor must know the set of 
tasks to perform. We seek solutions where the isolated processors can execute 
tasks independently such that when any two processors are able to communicate, 
the number of tasks they have both executed is as small as possible. We model 
solutions to the problem as sets of n lists of distinct tasks from {!,... , t} . We 
call such lists schedules. 

Consider an example with two processors (n = 2). Let the schedule of the 
first processor be (1,2,3,... ,t), and the schedule of the second processor be 
(t, t — l,t — 2, . . . , 1). In the absence of communication each processor works 
without the knowledge of what the other is doing. If the processors are able 
to communicate after they have completed h and t 2 tasks respectively and if 
h + t 2 < t then no work is wasted (no task is executed twice). If t± + t 2 > t, 
then the redundant work is ti + t 2 — t. In fact this is a lower bound on waste 
for any set of schedules. If some two processors have individually performed all 
tasks, then the wasted work is t. 

Contributions. This paper presents new results that identify limits on bounded- 
redundancy scheduling of t tasks on n processors during the absence of commu- 
nication, and gives efficient and effective constructions of bounded-redundancy 
schedules using deterministic and randomized techniques. 

Lower Bounds. In Section 3 we show that for any n schedules for t tasks 
the worst case pairwise redundancy when one processor performs ti and another 
t 2 tasks is 0 {tit 2 /t).^ e.g., the pairwise wasted work grows quadratically with the 
schedule length, see Figure l.(a). We also show that for n = t and for schedules 
with length exceeding the number of redundant tasks for two (or more) 
processors must be at least two. 

When t n scheduling is relatively easy initially by assigning chunks of 
tjn tasks to each processor. Our deterministic construction focuses on the most 
challenging case when t = n. 
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Fig. 1. Pairwise waste (redundancy) as a function of advancement through schedules 
for n = t: (a) lower bound, (b) deterministic construction (c) randomized construction, 
(d) diagonal vertical cut. 



Deterministic Construction of Short Schedules. We show in Section 4 
that it is in fact possible to construct schedules of length 0{^/n) such that 
exactly one redundant task is performed for any pair of processors. This result 
exhibits a connection between design theory [10,4] and the distributed problem 
we consider. Our design-theoretic construction is efficient and practical. The 
schedules are constructed by each processor independently in 0{y/n) time. 

Deterministic Construction of Long Schedules. Design theory offers 
little insight on how to extend a set of schedules into longer schedules in which 
waste is increased in a controlled fashion. We show in Section 5 that longer sched- 
ules with controlled waste can be constructed in time linear in the length of the 
schedule. This deterministic construction yields schedules of length such that 
pairwise wasted work increases gradually as processors progress through their 
schedules. For each pair of processors pi and P 2 ? the overlap of the first ti tasks of 
processor pi and the first tasks of processor p 2 is bounded by O + v^)* 
The upper bound on pairwise overlaps is illustrated in Figure 1(b). The quadratic 
growth in overlap is anticipated by our lower bound. The overall construction 
takes linear time and, except for the first ^Jn tasks, the cost of constructing the 
schedule is completely amortized. 

Randomized Constructions. Finally, in Section 6, we explore the behav- 
ior of schedules selected at random. Specifically, we explore the waste incurred 
when each processor’s schedule is selected uniformly among all permutations on 
{1, . . . , t}. For the case of pairwise waste, we show that with high probability 
these random schedules enjoy two satisfying properties: {%) for each pair of pro- 
cessors PijP 2 j the overlap of the first t\ tasks of processor pi and the first t 2 

tasks of processor p 2 is no more than + O ^log n + log"n^ , (n) all but 

a vanishing fraction of the pairs of processors experience no more than a single 
redundant task in the first tasks of their schedules. This is illustrated in Fig- 
ure 1(c). As previously mentioned, the quadratic growth observed in property 
{%) above is unavoidable. 

The results represented by the surfaces in Figures 1(a), (b) and (c) are com- 
pared along the vertical diagonal cut in Figure 1(d). 
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2 Definition and Models 

We consider the abstract setting where n processors need to perform t in- 
dependent tasks, where n < t. The processors have nniqne identifiers from 
the set [n] = {1,... and the tasks have nniqne identifiers from the set 
[t] = {1,. .. ,t}. Initially each processor knows the tasks that need to be per- 
formed and their identifiers, which is necessary for solving the problem in absence 
of communication. 

A schedule L is a list L = (r^ , . . . , r^) of distinct tasks from [t], where b is the 
length of the schedule (6 > 0). A system of schedules £ is a list of schedules for n 
processors £ = (Li, . . . ,£n-)* When each schedule in the system of schedules £ 
has the same length 6, we say that £ has length b. Given a schedule L of length 
6, and c > 0, we define the prefix schedule to be: — (r^, . . . ,r^), if c < 6, 

and — Lj if c > b. For a system of schedules £ and a vector a — (ui, . . . ,a^^) 
(®i ^ 0) ^ system of schedules £^ = , • • • , is called a prefix system of 

schedules. 

Sometimes we the order of tasks in a schedule is irrelevant, and we introduce 
the notion of plan as an unordered set of tasks. Given a schedule L — (r\ . . . , r® ) 
we define the plan P — P{L) to be the set P — {r^, . . . ,r®}. Given a schedule 
L and c > 0, we write P^ to denote the plan corresponding to the schedule 

(the set of the first c tasks from schedule L). For a system of schedules 
£ = (Li, . . . , a system of plans is the list of plans V = {P \ , . . . , where 
Pi is the plan for schedule Li. 

We can represent a system of plans as a matrix called a scheme. Specifically, 
given a system of plans P we define the scheme S to be the n x t matrix 
such that Sij = 1 if j € -G , and Sij = 0 otherwise. Conversely, a scheme S yields 
a system of plans P = (Fi, . . . , F„), where Pi = {m : Si^m = 1}; we say that 
Fi , . . . , F^ are the plans of scheme S. A scheme is called r-regular if each row 
has r ones, and fc-uniform if each column has k ones. Since scheme and system 
of plans representations are equivalent, we choose the most convenient notation 
depending on the context. When the ordering of tasks is important, we use the 
schedule representation. 

To assess the quality of scheme 5, we are interested in quantifying the 
‘Vasted” (redundant) work performed by a collection I of processors when each 
processor i {i e 1) performs all tasks assigned to it by the corresponding plan 
Pi of S. We formalize the notion of waste as follows. 

Definition 1. For a collection I C [n] of processors and a scheme S the /- 
waste of Sf denoted tc/(5), is defined as wi{S) — IJ’il - lu«, Pi\^ where 
Fi , . . . , F^ are the plans of S. 

In general, we are interested in bounding the worst case redundant work of 
any set of k processors that may (re)establish communication after they perform 
all tasks assigned to them. Hence we introduce k-waste by ranging Pwaste over 
all subsets I of size k: 

Definition 2. For a scheme S the fc-waste of S is the quantity Wk{S) — 
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For a system of schedules £ we write wu{C) to stand for wu{S)j where S 
is the scheme induced by £. In our work we are mostly interested in bounding 
fc” waste for the case when k = 2. Observe that is exactly \Pi D Pj\j so 

that in this case we are interested in controlling overlaps: 



Definition 3. We say that a scheme S is A-bounded if \Pi C\ Pj\ < A for all 
i ^ j. More generally^ S is [A, a] -bounded if for all sets U C [n] of cardinality 

u we have Pj < A. We say that S has A-overlap (or is A-overlappingj if 

there exists i ^ j so that \Pi D Pj\ > X, More generally^ S has [A, u]"Overlap if 
there is a set U C [n] of cardinality u such that Ifljec/ 



In this work we assume that it takes unit time to add, multiply or divide two 
log (max{n, t})-bit numbers. 



3 Lower Bounds on Processor-Pairs Overlaps 

In this section we show lower bounds for 2-waste. We prove that 2-waste has 
to grow quadratically with the length of system of schedules, and is inversely 
proportional to t. This is intuitive; if t n then it is easy to construct n 
schedules of at least [t/n\ tasks such that the resulting scheme is 0-bounded, 
i.e., the 2- waste of the scheme is 0. On the other hand if n = t then any system 
of schedules of length at least 2 must be 1-overlapping. A system of 1-bounded 
schedules of length 0(^) for t — n tasks was designed by Dolev et al [6]. We 
show that for n — t no schedules can have the length greater than ^/n and still 
be 1-bounded. 

We first show a key lemma that uses a probabilistic argument (see [1] for 
other proofs with this fiavor). Recall that given a schedule the plan Pf is the 
set of the first a tasks in Li. 

Lemma 1. Let C = (Li, . . . .^Lj^) be a system of schedules of length t, let0<a<t^ 
0<b<t, and A = max^^j n F/|. Then {n - 1)A > f a6 - min{a, b}. 

Proof We select i and j independently at random among [n] and bound the 
expected value of IPf fi P^\ in two ways. First observe that we have the total of 
n? pairs for i and j. If i ^ j then the cardinality of the intersection is bounded 
by A. If i = j then the cardinality is obviously min{a, b}. Hence 

Eripa Pi p&n n-(?z-l)A+?z-min{a,6} 

Li % I ^ “******" Tjf ^ 

For the second bound we consider t random variables Xrj indexed by r € [t], 
defined as follows: = 1 if r € Pf fi 0 otherwise. Observe that D 

P||] = A^^]. By linearity of expectation, and the fact that the events 

are independent, we may recompute this expectation 

E[|C“ n P^\] = Ere[t] = Ere[t] Pi" *= C“] • Pr -C-] 

Now we introduce the function x^{t)j equal to the number of the prefixes 
of schedules of length m to which r belongs, i.e., x^{t) = |{i : t € P[^}\- Using 
the fact that Pr[r € Pf*'] — f j and twice the Cauchy-Schwartz inequality, 
we can rewrite the expectation as follows. 
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E[|^“ n F||] = A E.em w > 



J IJ Th^ .1 ^ 1 iw/ T ^ 1^ ^ j 

2 > 



1 

tn? 



Finally, since \P^\ — m, we have that 
“P/0 ^ the result follows. 



(Ere[t]®“'(e) \j (Ere[t]®^(0 
;r^(r) = m^n. Hence E[|P^'^ D 



For any given system of schedules £, Lemma 1 leads to a lower bound on 
the pairwise overlap for any two processors i and j when i performs the tasks in 
Ff and j performs the tasks in Pj. The lower bound in the next theorem states 
that the pairwise overlap must be proportional to a ’ 6 (see Figure 1 (a) for the 
case when n = t). 



Theorem 1. Let £ = (Li, . . . , Ljj) be a system of schedules of length £ and let 
Then max^^ \Pf fi Pf\ > \j(jfbrjCtb - 

Immediate consequence of Theorem 1 is that 2- waste must grow quadratically 
with the length of the schedule. Observe that k- waste, for fc > 2, must be at least 
as big as 2-waste, because additional processors can only increase the number 
of tasks executed redundantly. Hence our next result is that k- waste must grow 
quadratically with the length of the schedule. 



Corollary 1. If C is a n-proeessor system of sehedules of length r for t = n 
tasks ^ where t >r^ then tCfc(£) > ]. 

Finally we show that no 1-bounded schedules exist of length greater than 
y^n - 3/4 + I > yTi. 

Corollary 2. If r > y^n — 3/4 + ^ then any n-processor sehedule of length r 
for n tasks is 2- overlapping. 

This result is tight: in Section 4 we construct an infinite family of 1-bounded 
schedules of length y^n — 3/4 + |. 



4 Construction of Deterministic ^^Square-root’’ Plans 

We now present an efficient construction of deterministic 1-bounded schedules 
with maximal 0{yTi) length, for n = t. In the rest of this section we assume 
that n = t. 

We briefiy introduce the concept of design^ the major object of interest in 
design theory. A reader interested in this subject is referred to, e.g., [10]. A 
design is a set of n points and t blocks (subsets of points) with the following 
properties. Each block contains exactly k points, each point is contained in (is 
on) exactly r blocks, number of blocks any subset of a points intersects (is on) 
is exactly A. An object with such properties is called <j"(n, fc. A) design. A design 
can be represented by an ineidenee matrix (aij) of zeros and ones. Numbering 
points and blocks, an element aj j of the matrix is 1 if point i is on block j 
and otherwise 0. Designs have many interesting properties. One fact is that a 




126 



G.G. Malewicz, A. Russell, and A. A. Shvartsman 



fc, A) design is also a U”(n, fc, A) design for 0 < u < <j. Not surprisingly for 
smaller u the number of blocks a subset of u points is on increases. This number 
is given by^: A ’ (see [10] Theorem 1.2). 

We now give the result linking design theory to our setting. 

Theorem 2. The incidence matrix of any a~{n, k^X) design with t Mocks yields 
a [A.^uybounded scheme {0 < u < a) for n processors and t tasks^ where each 
processor executes r — ^k tasks ^ each task is executed k times ^ and A — X ' 

* 

Proof Take any a distinct points of the design. By the definition of a-{n,k,X) 
design the number of blocks on these a points is equal to A. Hence the number 
of tasks executed in common by any a processors is exactly A. The formula for 
A results from Theorem 1.2 [10]. This is because the design is a (<j — (<j — w))- 
(n, fc, A) design, i.e., U”(n, fc, A) design, for A as in that theorem. Aloreover, since 
t ' k = n^r (see Corollary 1.4 [10]), each processor executes r = ^k tasks. 

Theorem 2 makes it clear that we need to look for designs with large k and 
small A because such designs yield long plans (large r) with small overlap (small 
A). We will consider a special case of this theorem for <j = 2. In this case we 
want to guarantee that 2- waste is exactly A (note that when u — a = 2, we have 
A = A). 

We use a well-known construction of a 2— (g^+g+1, g+1, 1) design, for a prime 
g. The algorithm is presented in Figure 2. It has the following properties: (1) For 
a given number i G {0, . . . , + g}, the value of a function blocksOnPoint(i) is a 

set of g + 1 distinct integers from {0, . . . , g^ + g}. (2) For i ^ j the intersection 
of the set blocksOnPoint(i) with the set blocksOnPoint(j) is a singleton from 
{0, . . . ,g^ + g}. For a proof these two standard facts from design theory see 
e.g. [10, 15]. Invoking the function blocksOnPoint(i) for any i requires finding at 
most two multiplicative inverses 6™^ and in Zg. We can do this in O(logg) 
by using the Extended EuclidN Algorithm (see [14], page 325). The worst case 
time of finding inverses is bounded, by the Lame theorem, by O(logg), see [14], 
page 343. This cost is subsumed by g iterations of the loop. Hence the total time 
cost of the function is 0(g). 

Theorem 3. // r ' (r — 1) = n — 1 and r — 1 = q is prime then it is possiUe to 
eonstruet a r -regular r -uniform 1 -bounded scheme for n processors and n tasks. 
Each plan is constructed independently in 0{yTi) time. 

Using our construction we can quickly compute schedules of size approxi- 
mately yTi for n processors and t — n tasks, provided we have a prime g such 
that g(g+l) = n — l.Of course in general, for a given n there may not be a prime 
g that satisfies g(g + 1) = n — 1. This however does not limit our construction. 
We discuss this in more detail in Section 5 

The expression y- is the Tailing poweF^ defined as y(y — l)(y — 2) . . . (g — cr + 1), 
with y^ = y^ = 1. 
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vectorTolndex( ) 

if a = 1 then return b ^ q + c 
else if b = 1 then return g • g + c 
else return g * g + g 

blocksOnPoint( i ) 

(aj6jc) = indexTo Vector ( i ) 
block = 0 

ifa = lA6/0Ac/0 then block 
for d = 0 to g — 1 do block 
ifa = lA6 = 0Ac/0 then block 
for d = 0 to g — 1 do block 
ifa = lA6^0Ac = 0 then block 
for d = 0 to g — 1 do block 
ifa = lA6 = 0Ac = 0 then block 
for d = 0 to g — 1 do block 
ifa = 0A6 = lAc/0 then block 
for d = 0 to g — 1 do block 
if a = 0 Ab = 1 A c = 0 then block 
for d = 0 to g — 1 do block 
if a = 0Ab = 0Ac = l then block 
for d = 0 to g — 1 do block 
return block 



indexToVector( i ) 

if z = g ' g + g then return (0^ 0^ 1) 
else if z > g ' g then return (0^ z — g • g) 
else return div q,i mod g) 



U= {vectorTolndex( 0^1^ — b * )} 

U= {vectorTolndex( (— 1 — c * d) * d )} 
U= {vectorTolndex( O^l^O )} 

U= {vectorTolndex( l^d^— c™^jd )} 

U= {vectorTolndex( O^O^l )} 

U= {vectorTolndex( 1^— b™^jd )} 

U= {vectorTolndex( (O^O^l) )} 

U= {vectorTolndex( O^ljd )} 

U= {vectorTolndex( 0^1^— )} 

U= {vectorTolndex( l^d^— )} 

U= {vectorTolndex( (O^O^l) )} 

U= {vectorTolndex( l^Ojd )} 

U= {vectorTolndex( (O^l^O) )} 

U= {vectorTolndex( l^d^O )} 



Fig. 2. Algorithm for finding g + 1 blocks on a point of a 2-(g^ + g + g + 1) design. 

The notation x U= y stands for x = xUy. Boldface font denotes arithmetic in Zg. 



5 Constructing Long Deterministic Schedules 

Applying design theory principles to constructing longer schedules is not neces- 
sarily a good idea. If we took a design with blocks of size k > y/n we could build 
a corresponding system of schedules using Theorem 2. Observe that Theorem 1 
guarantees that such system would have overlap Unfortunately there 

would be no guarantee that the overlap would increase gradually as processors 
progress through their schedules. In particular, overlap may be incurred 

even if two processors “meet” only after executing O(^) tasks. 

In this section we present a construction for longer schedules with the goal of 
maintaining a graceful degradation of overlap. Our novel construction extends 
the v^-length system of plans obtained in Theorem 3 so that the increase of 
overlap is controlled as the number of tasks executed by each processor grows. 
In the following sections we construct raw schedules, and then show how to use 
them to produce schedules with graceful degradation of overlap for arbitrary 
value of n. 

Raw Schedules. In this section we build long raw schedules that have repeated 
tasks. We assume that n = — r + 1 and r = g + 1 for a prime g and use the 

construction from Theorem 3. Let V = (Fi, . . . , Pn) be the resulting 1-bounded 
system of n plans of length r, where is the plan for each u € {!,... , n}. For 
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a processor u {1 < u < n) let , . . . , tQ be the sequence of tasks, in some 

order, from the plan constructed as in Theorem 3. We introduce the term 
raw schedule to denote a sequence of task identifiers where some tasks may be 
repeated. 

We now present and analyze a system 1Z{P) of raw schedules. For each pro- 
cessor u, we construct the raw schedule Ru of length > n by concatenating 
(o) distinct where i € Pu. Specifically, we let o L ^2 o . . . o . Thus 

the raw schedule for processor u is (ti , ? ^ 2 , , , tL , ..,tL). Given 

Ru — ('ri, • • • ) we define R^ — (r^, . . . ,t 2) to be the the prefix of Ru of 

length a, and — {r^, . . . , r^} for 0 < a < r^. 

A direct consequence of Theorem 3 is that raw schedules can be constructed 
efficiently. 

Theorem 4. Eaeh raw sehedule in TZ{V) can he eonstrueted in 0{n) time. 

Note that it is not necessary to precompute the entire raw schedule, instead 
it can be computed in r-size segments as needed. Some of the tasks in a raw 
schedule may be repeated and consequently the number of distinct tasks in a 
raw schedule of length may be smaller than - naturally processors do not 
execute repeated instances of tasks. For the proof of graceful increase of pairwise 
redundancy it is important to show that the number of distinct tasks in our raw 
schedules increases gracefully. 

Theorem 5. For any Ru — Lfi o Lf 2 o . . . o — (r^, . . . , and l<a<r^ 
Ki = ,r“}| > (ffl ^2)) + max{0, ^l)(r+l)}. 

Proof. Consider the task r®. It appears in where i — |"^]. For tasks that 
appear in plans , . . . , P^i-i the number of repeated tasks is at most 1 + . . . + 
(^ — 2) = (i — l)(i — 2)/2 because is a 1-bounded system of plans (any two of 
these plans intersect by exactly one, see Theorems 3). Hence there are at least 
_ 2)/2 distinct tasks in the raw schedule o . . . o 
We now assess any additional distinct tasks appearing in . Task r® is the 
task number a — {i — l)r in Since P is 1-bounded, up to i — 1 tasks in 
may already be contained F^i , . . . , . Of course in no case may the number 

of redundant tasks exceed a — {i — l)r. Hence the number of additional distinct 
tasks from F^i^ is at least max{0, a— (i — l)r — (i — 1)} = max{0, a—{i — l){rpl)}. 

Corollary 3. Any Ru contains at least r) — P r — ^ distinet tasks. 

Together with Theorem 4, this result also shows that the schedule compu- 
tation is fully amortized, since it takes 0{n) time to compute a schedule that 
includes more than n/2 distinct tasks. 

For any processors u and w we wish to determine {w, tc}-waste as u and w 
progress through the raw schedules Ru and F^. We now show that for 1 < a, 6 < 
the size of F® fi grows gracefully as a and b increase. 

Theorem 6. For any Fy , F^ and 0 < a, 5 < ; \T^ D F^| < min{ bj r — 
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Proof, By the definition of V and the raw schedules Ru and 

where i = and € P^j , where j = Therefore, C U . . . U 

and C U . . . U F^i . Consequently, 

r“- n u . . . u ) n (f*^, u . . . u F^y = Ui<,<,, i<,<i(Cs n f,. ). 

Since the system of plans F is Abounded, and R^ contain at most one 
common F^ for some 1 < 0 < r. In the worst case, for the corresponding F^, this 
contributes |F^ n F^| = |F^| = r tasks to the intersection of F^ and F^. On the 
other hand, \Pt^ fi F^y | < 1 when both and are not 0 , again because F is 
l-bounded. Thus, iFnr^l < r+| ^^(F*. nF*. )| < r+i-j^l. 

Finally, the overlap cannot be greater than min{a, 5|. 

In the following theorem we show how the useful work (not redundant) grows 
as processors progress through their schedules. 

Theorem 7. For any processors u and w: 

(a) IfiPj<r then UFi/'^^' | > r(i + j) - r + 1 - |((i + + ^ + 

(b) If i + j>r then U | ^ r + |. 

Proof. By Theorem 5 jrt''’’' | > i ■ r — i{i — l)/2 and | > j ■ r — j{j — l)/2, 

and by Theorem 6 D \ <r-l + i-j. Thus: 

irt"') u ry’'^| = > (i + f){r ^ ^ r + 1 

Consider the function f{i + j) = f{x) = x ' {r — ^ r + 1 = —^x‘^ + (t> + 

+ (1 — r). It is nonnegative for 2 < ;r < 2r. Additionally f{x) grows from r, 

for ;r = 2, to a global maximum of^ — | + |,for;r = r+ |, and then decreases 

to 1, for ;r = 2r. Because \Tu^^'\ and |Fi/'^' | are monotone nondecreasing in i 
and j respectively (the number of tasks already performed by processors cannot 
decrease), we have that U Fi/'^' | > ^ — | + | for i j > r. 

Deterministic Construction for Arbitrary n. We now discuss practical 
aspects of using the system of raw schedules F(F). Recall that a raw schedule for 
a processor contains repeated tasks. When a schedule is compacted by removing 
all repeated tasks, the result may contain about half of all tasks (Corollary 3). 
To construct a fall schedule that has all t = n distinct tasks, we append the 
remaining tasks at the end of a compacted schedule (in arbitrary order). For 
the system F(F) we call such a system of schedules F(F) = (Fi, ... , F„). For a 
schedule we write Ni to denote the corresponding plan. In this section we use 
our results obtained for raw schedules to establish a bound on pairwise overlap 
for F(F). Recall that by construction, the length of F(F) is + 1 + 1, where q 
is a prime. We show that common padding techniques can be used to construct 
schedules for arbitrary n = t such that the pairwise overlap is similarly bounded. 

First we analyze overlaps for a system of schedules F(F). Assume that a 
processor u advanced to task number i^r in its raw schedule Ru {I < ^ < ?")• Then, 
by Theorem 5, it has executed at least i{r — ^) distinct tasks. Conversely, for a 
given X we can define g(x, r) to be the number of segments of the raw schedules 
Ru that are sufficient to include x distinct tasks, i.e., | > x. Solving the 
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quadratic equation p(x,r)(r — = x yields gix^r) = |- — ££-| 

, for = 0, . . . 5 |(r^ + r) (observe that p(0,r) = 0,^(1, r) = l,p(r,r) = l,^(r + 
l,r) = + r),r) = r). In the next theorem we use the definition of g 

and the result from Theorem 6 to construct a system of schedules with bounded 
overlaps (see Figure Lb for the plot of the upper bound). 

Theorem 8. For n — + = r — 1 prime^ the system of sehedules 

F{V) ean be eonstrueted determdnistieaMy in time 0{n) independently for eaeh 
proeessor. Pairwise overlaps are bounded by: 

l/V“n Vi < f - 1 + g{a,r) ■ g{b,r)} , a,b < |(r2 + r), 

I u'' ^ ^min{a,6}, otherwise. 

We next show that for long lengths pairwise overlap is strictly less than 
min{a, 5} (the trivial part of the upper bound shown in Theorem 8). Assume 
that processors u and w have advanced to task number i ’ r in and Ry; 
respectively (1 < i < c). By Theorem 5 the number of distinct tasks executed by 
each processor is at least i(r— By Theorem 6 the overlap is at most r— 1+i^. 
Equating the two expressions yields an equation, solutions to which tell us for 
which i the overlap does not exceed the number of distinct tasks in the schedule. 
The first (trivial) solution i = 1 simply describes the possibility of two processors 
executing the same r tasks when the first task identifier in Py is the same as 
that of Py;. The second solution i — |(r — 1), with Theorem 5, gives the number 
of distinct tasks in each schedule, which is no less than + |(r — 5). This gives 
guarantees that, using 1Z{V)j there are no two processors that execute the same 
subsets of tasks when each executes up to + |(r — 5) tasks. Hence as long as 
processors have not executed more that |n — 0{yRi) tasks, the nontrivial part 
of the upper bound in Theorem 8 applies. The remaining tasks (approximately 
I of the tasks) can be chosen by the processors arbitrarily (for example using a 
permutation) since our approach does not provide non-trivial overlap guarantees 
in that region. Note however, that for schedules longer than |n the lower bound 
on 2” waste, by Theorem 1, is approximately which is already linear in n. 

We now discuss the case when the number of processors n is not of the form 
+ g + 1, for some prime g. Since primes are dense, for any fixed e > 0 and 
sufficiently large n, we can choose^ a prime p in 0{n) time such that n — 1 < 
p{p +1) < (1 + e)n — 1. Using standard padding techniques we can construct a 
system of schedules of length n with overlap bounded similarly to Theorem 8. 
An easy analysis yields that the upper bound is strictly lower than the trivial 
bound as long as processors advance at most |n — 0{^/n) — 0{n^/e) through 
their schedules. 

In our presentation we assume that a suitable prime is available. The prime 
can be computed as follows: Find an integer p € [yRijyRi{l + e)] that satis- 
fies: 1) n — 1 < p{p +!)<(! + e)n — 1, and 2) p is not divisible by any of 
2, 3, 4, 5, . . . , |"n^/^(l + e)] . This gives 0(en^/^) time algorithm. Alternatively, if 

^ This results from the Prime Number Theorem. Due to lack of space we show this in 
the technical report [15]. 
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we assume the Extended Riemann Hypothesis, we can use an algorithm from 
[17] to find the prime in 0(ev^log^ nlogloglogn). In any case the cost is ex- 
pended once at the beginning of the construction, and this prime can be used 
multiple times so that this cost can be amortized over long-lived computations. 
Moreover, this cost does not distort the linear complexity of schedule construc- 
tion. Finally observe that the schedules are produced in segments of size 0(y^). 
Thus if processors become able to communicate prior to the completion of all 
tasks then at most ^/n tasks would have been scheduled unnecessarily. 

6 Randomized Schedules 

In this section we examine randomized schedules that, with high probability, 
allow us to control waste for the complete range of schedule lengths. 

When the processors are endowed with a reasonable source of randomness, 
a natural candidate scheduling algorithm is Random, where processors select 
tasks by choosing them uniformly among all tasks they have not yet completed. 
This amounts to the selection, by each processor i, of a random permutation 
TTi € 5[t] after which the processor proceeds with the tasks in the order given by 

TT^: 7Ti(l),7ri(2), (5[t] denotes the collection of all permutations of the set [t].) 

These permutations {tt^ | i € [n]} induce a system of schemes: specifically, 
coupled with a length ii < t for each processor i, such a family of permutations 
induces the plans f = 7ri{[ii]) which together comprise the scheme 5[^j. Our 
goal is to show that these schemes are well behaved for each guaranteeing that 
waste will be controlled. For 2- waste this amounts to bounding, for each pair i.J 
and each pair of lengths the overlap |7Ti([£i]) D 7rj{[£j]) \ . Observe that when 
these TTi are selected at random, the expected size of this intersection is 
and our goal will be to show that with high probability, each such intersection 
size is near this expected value. This is the subject of Theorem 9 below: 

Theorem 9. Let 'K i be a family of n permutations o/ [t], chosen independently 
and uniformly at random. Then there is a eonstant c so that with probability at 
least 1 — 1/n, the following is satisfied: 

Vi, j < n and 'iiijij < \7Ti([ii]) H 7Tj{[ij])\ < ^ + A, for A — A{ii^ij) = 

cmax ^logn, ^ ^ log 

2. < clogn, the number of pairs i.J for whieh |7Ti([v^]) fi 0 is 

at most * 

In partieular^ for eaeh tC2(5[^j) < max^j ^ + A{£ij£j). 

Observe that Theorem 1 shows that schemes with plans of size £ must have 
£f^ ft overlap; hence these randomized schemes, for long regular schedules (i.e., 
where the plans considered have the same size), offer nearly optimal waste. 
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The following part of the section is devoted to proving Theorem 9. The 
analysis is divided into two sections, the first focusing on arbitrary pairs of 
lengths and the second focusing on specifically on “smalF lengths i < y/i. 

Behavior for arbitrary lengths. 

Consider two sets A c [t] and B c [t]j A being selected at random among 
all sets of size and B at random among all sets of size Then 

Exp[|AnB|] = (lAds- Judicious application of standard Chernoff bounds cou- 
pled with an approximation argument yields the following theorem: 

Theorem 10. Let A and B be ehosen randomly as above. Then there exists a 
eonstant c > 0 so that for all n and t < n, Pr [\A n B\> clAds + A{dAj ds)] < 
^ where A{dAj ds) = c ^log n (Vdjd^ + ^/log n ) . (The constant c is indepen- 
dent oft and n.) 

A proof of this fact can be found in a technical report [15]. Let c be the 
constant guaranteed by the above corollary and let be the (bad) event that 
^ djdj + zi(di,dj), where £i — diVi and £j — djVi. Let an 
event Bi be defined as disjunction £■ B'f'f^ . Considering that Pr < 

we have Pr [Bi] < x Pr \Bff^ < Hence 

Pr ki(M) n 'Kj{[£j])\ < didj + A{di,dj)\ > 1 - ^ 

We now concentrate on the behavior of these schedules for lengths £ < ^/i. 

Behavior for short lengths. 

Observe that for any pair (i.J) of schedules, Exp [|7Ti([v^]) fl ^i([^])|] = 1- We 
would like to see such behavior for each pair (ijj). Let an event B 2 be defined 
as |7Ti([v^]) n7Tj([v^])| > Cologn. From the previous argument, there is a 
constant Cq so that Pr [B 2 ] < Considering that the expected value of this 
intersection is 1, we would like to insure some degree of palatable collective 
behavior: specifically, we would like to see that few of these overlaps are actually 
larger than a constant, say. To this end, let lij = |7Ti([v^]) D 7Tj([v^])|, and 

observe that Exp hj — ( 2 )* We may write lij — where 

is the indicator variable for the event 7Ti(m) € 7Tj{[Vi]). Observe that these 
variables are negatively correlated (i.e., Cov [A^^, A^^/j < 0 for each pair) so 
that Var [Tij] < J^L=i [^m] < E1 =i ^xp [X„] < Exp [liA . (Recall that for 
any indicator variable A^, Var[A^] < Exp[A^j.) Observe now that the variables 

lij are pairwise independent so that Var = X^-^Var[/ij] < ( 2 ), 

and an application of Chebyshev’s inequality to the quantity yields 




Collecting the pieces yields Theorem 9 above, since Pr [Bi] < X ^nd Pr [B 2 ] < 
Pr [Bi V B 2 ] < as desired. 
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Abstract. We show that Naming- the existence of distinct IDs known 
to all- is a necessary assumption of HerlihyN universality result for Con- 
sensus. We then show in a very precise sense that Naming is harder than 
Consensus and bring to the surface some important differences existing 
between popular shared memory models which usually remain unnoticed. 



1 Introduction 

The consensus problem enjoys a well-deserved reputation in the (theoretical) 
distributed computing community. Among others, a seminal paper of Herlihy 
added further evidence in support of the claim that consensus is indeed a key 
theoretical construct [12]. Herlihy ’s paper considers the following problem: Sup- 
pose that, besides a shared memory, the hardware of our asynchronous, parallel 
machine is equipped with objects (instantiations) of certain abstract data types 
Ti,r 2 ? • • • given this, is it possible to implement objects of a new abstract 
data type Y in a wait-free manner? This question is the starting point of an 
interesting theory leading to many results and further intriguing questions (see 
[12,14] among others). Roughly stated, one of the basic results of this theory, 
already contained in the original article of Herlihy, is this: If an abstract data 
type X, together with a shared memory, is powerful enough to implement con- 
sensus for n processes in a wait-free manner then, X, together with a shared 
memory, is also powerful enough to implement in a wait-free manner for n pro- 
cesses any other data structure Y. This is Herlihy N celebrated universality result 
for consensus. 

In this paper we perform an analysis of some of the basic assumptions un- 
derlying Herlihy N result and discover several interesting facts which, in view 
of the above, are somewhat counter-intuitive and that could be provocatively 
be summarized by the slogans “consensus without naming is not universal” and 
“naming with randomization is universal.” To state our results precisely we shall 
recall some definitions and known results. 

The naming problem is as follows: Devise a protocol for a set of n processes 
such that, at the end, each non faulty process has selected a unique identifier 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 134-148, 2000. 

(c) Springer- Verlag Berlin Heidelberg 2000 
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(key). If processes have identifiers to start with then we have the renaming 
problem. Besides time and space, the size of the nome space too- the set of 
possible identifiers- is considered to be a resource whose consumption is to be 
minimized. 

We shall concern ourselves with probabilistic protocols- every process in 
the system, modelled as an i/o automaton, has access to its own source of unbi- 
ased random bits- for systems consisting of asynchronous processes communi- 
cating via a shared memory. Processes can suffer from crash failures. The 
availability of objects of abstract data type consensus and naming is assumed. 

The protocols we devise are wait-free (i.e. {n — l)-resilient) in spite of the 
adversary, the “malicious” non-deterministic agent (algorithm) modeling the 
environment. The adversary decides which, among the currently pending oper- 
ations, goes on next. Pessimistically one assumes that the adversary is actually 
trying to force the protocol to work incorrectly and that the next scheduling 
decision- which process moves next- can be based on the whole past history of 
the protocol execution so far. This is the so-called adaptive or strong adver- 
sary. In contrast, sometimes it is assumed that the adversary decides the entire 
execution schedule beforehand. This is the so-called oblivious or weak adversary. 

In the literature two shared-memory models are widespread. The first as- 
sumes multiple reader - multiple writer registers, in which every location 
of the shared memory can be written or read by any process. The other model 
assumes multiple reader - single writer registers. Here, every register is 
owned by some unique process, which is the only process that can write on that 
register, while every process is allowed to read the contents of any register. If 
the processes use a common index scheme for other processes registers (an ini- 
tial consistent numbering among the processes as it is called in [7,8, 13]), then 
optimal naming is trivial by having every process rank its own number among 
the other values and choose that rank-number as its key. To make the problem 
non-trivial, we assume that each process p accesses the n register by means of a 
permutation Wp. That is, register tt^- p’s ith register- will always be the same 
register, but, for p 7 ^ g, tt^ and ttJ might very well differ. Besides making the 
problem nontrivial, this models certain situations in large dynamically changing 
systems where the consistency requirement is difficult or impossible to main- 
tain [17] or in cryptographical systems where this kind of consistency is to be 
avoided. In both models reads and writes are atomic operations; in case of con- 
current access to the same register it is assumed that the adversary “complies 
with” some non-deterministic, but fair, policy. In this paper we shall refer to the 
first as the symmetric memory model and to the second as the asymmetric 
memory model. It is worth pointing out that the first model is considered to be 
more “powerful” than the second. As we shall see, this intuition is somewhat 
misleading. 

In this paper we show the following. Assume that eaeh proeesses has aeeess 
to its own private source of independent random bits. Then, 

— Consensus is ^‘easy”; Assuming that (a) the memory is symmetric, and 

(b) processors are identical i/o automata without identifiers then, there exist 
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wait-free, Las Vegas consensus protocols for n process, for any n > 1, which 
work correctly even assuming the strong adversary. In fact, we exhibit a pro- 
tocol whose running time (per processor) is polynomial both in expectation 
and with high probability. 

— Naming is ^‘hard”: In contrast, if (a) the memory is symmetric, (b) pro- 
cessors are identical i/o automata without identifiers which have access to 
(c) consensus objects then. Las Vegas naming is impossible, even assuming 
the weak adversary. Note that Montecarlo naming is trivial- it is enough 
that each process generate O(logn) many random bits and with probability 
1 — o(l) no two of them will be identical. ^ 

^ Naming + Symmetry = Asymmetry; If the memory is symmetric and 
processes have unique identifiers then the memory can be trivially made 
asymmetric. In this paper we show the other direction of the equivalence 
above namely, if the memory is asymmetric then, naming is possible even 
against the strong adversary. We exhibit a simple, modular protocol whose 
running time and space per process are polynomial in n, the number of 
processors, both in expectation and with high probability. The size of the 
name space is optimal, thereby improving upon a previous result of [19]. 

These results show that in a very precise sense, somewhat surprisingly, nam- 
ing is harder, or perhaps more “fundamental”, than consensus. As the second 
result shows, naming is impossible in a model richer than a model in which 
consensus is possible. Therefore naming, or some other form of asymmetry, is 
a necessary assumption in HerlihyN universality result. An inspection of Her- 
lihyN construction does show that the assumption that processes have unique 
identifiers known to all, i.e. naming, plays a crucial role [12]. One might wonder 
whether in his model (deterministic, symmetric memory) names can be created 
from scratch. As we show in this paper, the answer is In fat, they cannot 

even be generated if randomness (and consensus objects) are allowed. In view 
of this it would be more precise to restate HerlihyN result as “consensus plus 
naming is universal”. Notice that, as the third result in the above list shows, if 
randomness is allowed then, asymmetric memory is powerful enough for naming. 
Thus, we have yet another indication that randomness increases the power of 
distributed systems as far as fault-tolerance is concerned. 

A fundamental problem to confront with when dealing with parallel or dis- 
tributed computation is “symmetry breaking.” It is well-known that random- 
ness is quite helpful in breaking the symmetry and that identifiers are another 
effective means to deal with the problem. Our results show that randomness 
is enough for consensus, a supposedly universal construct, but not for naming. 
Furthermore, they show that single- writer registers are inherently symmetry- 
breaking, whereas consensus is not. As a byproduct of our analysis we show that 
HerlihyN universality result does not apply to another, quite basic data type, 
unless naming or some other form of asymmetry is assumed. The data type in 

^ Recall that a Las Vegas protocol is always correct and that only the running time 
is a random variable, while for a Montecarlo protocol correctness too is a random 
variable. 
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question, which we call selectWinner, selects a unique winner among a set of 
n invoking processes. This task is impossible with symmetric memory, even if 
randomness and consensus are available. This result also shows in a different 
way how the power of randomization to “break the symmetry” is limited. 

Interestingly, all randomized consensus algorithms existing in the literature 
known (to the authors) assume either asymmetric memory or the existence of 
identifiers (see, for instance, [1, 3,4, 11]). Our results indicate that these assump- 
tions are not necessary and that randomness is enough to create the “necessary 
asymmetry” as far as consensus is concerned. 

Recently Chandra developed a very fast algorithm for consensus (whose 
asymptotic performance was subsequently improved by Aumann [9,6]). His al- 
gorithm is much faster than the lower-bound shown by Aspnes for consensus 
[2]. How is this possible? Aspnes result holds for the asymmetric model, while 
Chandra uses several assumptions which, a priori, could be responsible for the 
speed-up: the availability of multiple reader - multiple writer registers in- 
stead of multiple reader - single writer registers- i.e. symmetric instead 
of asymmetric memory; pre-existing identifiers, and the intermediate adver- 
sary. This is a third kind of adversary, lying between the weak and the strong. 
Its behaviour is adaptive, but it has limited access to the outcome of the coin 
fiips in that it can read the outcome of a coin fiip only when this is read by some 
process, and not when the bit is generated (see, among others, [9]). 

At first it would seem that the naming assumption must be the least im- 
portant, perhaps even superfiuous, for multiple reader - multiple writer 
registers certainly can cause a speed-up, and assuming a weaker adversary can 
circumvent impossibility results such as that of Aspnes. In fact, our results shows 
that, as far as Chandra’s protocol is concerned, naming is necessary, for identi- 
fiers cannot be generated from scratch in his model. Aumann showed that the 
lower bound of Aspnes can be circumvented even assuming asymmetrc mem- 
ory, i.e. single writer- multiple reader registers [6] We leave it as an open 
question whether the same holds without the naming assumption. 

Our results also show that the widespread intuition that multiple reader 
- multiple writer registers are more “powerful” than multiple reader - 
single writer registers is somewhat misleading. The intuition is correct un- 
der the (quite reasonable) assumption that processes. It is however worth point- 
ing out that the intuition is wrong without this assumption, for naming can 
be solved with multiple reader - single writer registers, but it is impos- 
sible with multiple reader - multiple writer registers, even if randomness 
is allowed. This highlights an important difference between these two models 
which, we feel, it is often overlooked and show that memory assumptions must 
be carefully stated. 

Our first result, a randomized, wait-free protocol for consensus in the sym- 
metric model (which can withstand the strong adversary) is obtained by combin- 
ing several known ideas and protocols, in particular those in [3] and [9]. When 
compared to the protocol in [3] it is, we believe, simpler, and its correctness is 
easier to establish (see [20]). Moreover, it works in the less powerful symmetric 
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model and can deal with the strong adversary, whereas the protocol in [9] works 
only against the intermediate adversary. From the technical point of view, onr 
second result is essentially contained in [16] to which we refer for other interesting 
related results. Related work can be found in [5,?]. 

In spite of the fact that we make use of several known technical ingredients, 
our analysis, we believe, is novel and brings to light for the first time new and, 
we hope, interesting aspects of fundamental concepts. 

2 Consensus is easy. Naming is hard 

We start by outlining a consensus protocol assuming that (a) the memory is sym- 
metric, (b) processes are i/o automata without identifiers which have access to 
their own source of (c) random bits. Our protocol is obtained by combining 
together several known ideas and by adapting them to our setting. The proto- 
col, a randomized implementation of n-process binary consensus for symmetric 
memory, is a modification of the protocol proposed by Chandra [9]. The orig- 
inal protocol cannot be used in our setting since its shared coins require that 
processes have unique IDs. Thus, we combine it with a modification of the weak 
shared coin protocol of Aspnes and Herlihy [3]. The latter cannot be directly 
used in our setting either, since it requires asymmetric memory. Another differ- 
ence is that, unlike in Chandra’s protocol, we cannot revert to Aspnes’ consensus 
[1]. In this paper we are only interested in establishing the existence of a polyno- 
mial protocol and make no attempt at optimization. Since the expected running 
time of our protocol is polynomial, by Markov’s Inequality, it follows that the 
running time and, consequently, the space used are polynomial with high prob- 
ability (inverse polynomial probability of failure). Conceivably superpolynomial 
space could be needed. We leave it as an open problem whether this is necessary. 
In the sequel we will assume familiarity with the notion of weak shared coin of 
[3] to which the reader is referred. 

The protocol, shown in Figure 1, is based on the following idea. Processes 
engage in a race of sorts by splitting into two groups: those supporting the 0 
value and those supporting the 1 value. At the beginning membership in the 
two “teams” is decided by the input bits. Corresponding to each team there is a 
“counter”, implemented with a row of contiguous “fiags”- the array of booleans 
Mark[, ]- which are to be raised one after the other starting from the left by 
the team members, cooperatively and asynchronously. The variable position p of 
each process records the rightmost (raised) fiag of its team the process knows 
about. The protocol keeps executing the following loop, until a decision is made. 
The current team of a process is defined by the variable estimatep. The process 
first increments its own team counter by raising the positiorip-th fiag of its own 
team (this might have already been done by some other team member, but 
never mind). This means that, as far as the process is concerned, the value of 
its own team counter is positiorip (of course, this might not accurately refiect 
the real situation). The process then “reads” the other counter by looking at 
the other team’s row of fiags at positions positiorip + positiorip, positiorip — 1, 
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in this order. There are four cases to consider: (a) if the other team is ahead 
the process sets the variable newEstimatep to the other team; (b) if the two 
counters are equal, the process flips a fair coin X € {0,1} by invoking the 
protocol GetCoin^(, ) and sets newEstimatep to X; (c) if the other team trails 
by one, the process sticks to its team, and (d) if the other team trails by two 
(or more) the process decides on its own team and stops executing the protocol. 
Before executing the next iteration, the process checks again the counter of its 
own team. If this has been changed in the meanwhile (i.e. if the {positiorip + I)- 
st flag has been raised) then the process sticks to his old team and continues; 
otherwise, it does join the team specifled by newEstimatep (which in case of a 
random coin flip can still be the old team). The array MARK[i, s] implemented 
with multiple reader - multiple writer registers, while the other variables 
are local to each process and accessible to it only. 

A crucial difference between our protocol and that of Chandra concerns pro- 
cedure GetCoin^(, ). In Chandra’s setting essentially it is possible to implement 
“via software” a global coin^ thanks to the naming assumption. In the imple- 
mentation in Figure 1, we use a protocol for a weak shared coin for symmetric 
memory. For every b € {0,1} and every i > 1 an independent realization of 
the weak shared coin protocol is performed. An invocation of such a protocol 
is denoted by GetCoin^( 6, i), where 5 is a positive real that represents the 
agreement parameter of the weak shared coin (see [3]). 

First, we prove that the protocol in Figure 1 is correct and efficient. Later 
we show how to implement the weak shared coin. 

Lemma 1. If some process decides v at time t, then^ before time t some process 
started executing propose(v). 

Proof The proof is exactly the same of that of Lemma 1 in [9]. 

Lemma 2. No two processes decide different values. 

Proof The proof is exactly the same of that of case (3) of Lemma 4 in [9]. 
Lemma 3. Suppose that the following conditions hold: 

i) Mark[6, i] = true at time C 

ii) Mark[ 1 — G i] = false before time C 

Hi) Mark[1 — 6, i] is set true at time f (f > t)^ and 

iv) every invocation of both GetCoin^( 6, i) and GetCoin^( 1 - 6, i) yields 
value b, 

Then^ no process sets Mark[1 ^ ^ + 1] to true. 

Proof The proof is essentially the same of that of the Claim included in the 
proof of Lemma 6 in [9]. 

The next lemma is the heart of the new proof. The difficulty of course is that 
now we are using protocol GetCoin^(, ) instead of the “global coins” of [9], 
and have to contend with the strong adversary. The crucial observation is that if 
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{Initialization} 

MaRK[0, 0], MARK[1, 0] 4- true 
{Algorithm for process p] 

function propose{v): returns 0 or 1 

1. estimatep ^ V 

2 . positionp 4 - 1 

3. repeat 

4. MARKfestimatCp, positionp] 4- true 

5. if Mark(1 — esiimatepy positionp + 1] 

6. nevDEstimatep 4— 1 — estimatep 

7. else if Mark[1 — estimatep^ positionp] 

8. newEstimatep ^ GetCoin^ (esHmafCp, positionp) 

9. else if Mark [1 — estimatep , positionp — 1] 

10. newEstimatep estimatep 

11. else return(e5iima<ep) {Z?eci<ie estimatep} 

12. if not MARK[estimatep, positionp + 1] 

13. estimatep 4- newEstimatep 

14. positionp 4— positionp + 1 
end repeat 



Fig* 1. n-process binary consensus for symmetric memory 



two teams are in the same position i and the adversary wants to preserve parity 
between them, it must allow both teams to raise their flags “simultaneously,” i.e. 
at least one teammate in each team must observe parity in the row of flags. But 
then each team will proceed to invoke GetCoin<5(, ), whose unknown outcome 
is unfavourable to the adversary with probability at least (5/2)^. 

Lemma 4. If Mark[6, i] = true at time t and Mark[ 1 - 6, i] — false before 
time t, then with probability at least <5^/4, Mark[ 1 — 6, t -f 1] w always false. 

Proof. If Mark[1 — by i] is always false, then it can be shown that M ARK[1 — 6, f *f 1] 
is always false (the proof is the same of that of Lemma 2 in [9]). So, assume that 
Mark[1 — by i] is set to true at some time (clearly, t* > t). Since no invoca- 
tion of both GETCoiNe5(6, i) and GetCoin ^(1 - 6, f) is made before time t, the 
values yielded by these invocations are independent of the schedule until time t. 
Thus, with probability at least 5^/4, all the invocations of GetCoinj(6, i) and 
GetCoin5(1 - by i) yield the same value h. From Lemma 3, it follows that, with 
probability at least (5^/4, Mark[ 1 — 6, i + 1] is always false. 

Theorem 1. The protocol of Figure 1 is a randomized solution to n-process 
binary consensus. Assuming that each invocation of GetC01Nj(, ) costs one 
unit of timCf the expected running time per process 0(1). Furthennore, with high 
probability every process will invoke GetCoiN 5 (, ) O(logn) many times. 
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Proof. (Sketch) FVom Lemma 2 the protocol is consistent and from Lemma 1 
it is also valid. Thus, the protocol is correct. 

As regarding the expected decision time for any process, let P(i) denote the 
probability that there is a value 6 € {0, 1} such that Mark[ 6, i] is always false. 
FVom Lemma 4, it follows that 

P(i) >!_(!_ (52/4)f-i j>i 

Also, if Mark[ 6, t] is always false, it is easy to see that all the processes decide 
within i + 1 iterations of the repeat loop. Thus, with probability at least 1 - (1 - 
all the processes decide within i -f I iterations of the repeat loop. This 
implies that the expected running time per process is 0(1). The high probability 
claim follows from the observation that pessimistically the process describing the 
invocations of GETCoiNi(, ) can be modelled as a geometric distribution with 
parameter p := (<5/2)^. 

We now come to the implementation of the weak shared coin for symmetric 
memory, which we accomplish via a slight modification of the protocol of Aspnes 
and Herlihy [3]. In that protocol the n processes cooperatively simulate a ran- 
dom walk with absorbing barriers. To keep track of the pebble a distributed 
counter is employed. The distributed counter is implemented with an array of 
n registers, with position i privately owned by process i (that is, naming or 
asymmetric memory is assumed). When process i wants to move the pebble it 
updates atomically its own private register by incrementing or decrementing it 
by one. The private register also records another piece of information namely, 
the number of times that the owner updated it (this allows one to show that 
the implementation of the read is linearizable). On the other hand, reading the 
position of the pebble is a non-atomic operation. To read the counter the pro- 
cess scans the array of registers twice; if the two scans yield identical values 
the read is completed, otherwise two more scans are performed, and so on. As 
shown in [3], the expected number of elementary operations (read’s and write’s) 
performed by each process is 0(n*). 

Since in our setting we cannot use single-writer registers, we use an array CQ 
of n* multiple- writer multiple-reader registers for the counter. The algorithm for 
a process p is as follows. Firstly, p chooses uniformly at random one of the n* 
registers of C[], let it be the Arth. Then, the process proceeds with the protocol of 
Aspnes and Herlihy by using C[A:] as its own register and by applying the count- 
ing operations to all the registers of C[]. Snce we are using registers instead 
of 71, the expected number of steps that each process performs to simulate the 
protocol is 0(n®). The agreement parameter of the protocol is set to 2eS. Since 
the expected number of rounds of the original protocol is 0(n''), by Markov’s 
Inequality, there is a constant B such that, with probability at least 1/2, the 
protocol terminates within Bn® rounds. It is easy to see that if no two processes 
choose the same register, then the protocol implements a weak shared coin with 
the same agreement parameter of the original protocol in 0(n®) many steps. To 
ensure that our protocol will terminate in any case, if after Bn® steps the process 
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lias not yet decided then it flips a coin and decides accordingly. Thus, even in 
case of collision the protocol terminates with a value 0 or 1 within 0(n®) steps. 
The probability that no two processes choose the same register is 

Thus, the agreement parameter of our protocol is at least 1/2 • 1/e • 2e<5 = tf. We 
have proved the following fact. 

Proposition 1, For any 6 > 0, a weak shared coin with agreement parameter 
S can be implemented in the symmetric model (with randomization) in 0{n^) 
steps, even against the strong adversary. 

Therefore the expectd running time per process of the protocol of Theorem 1 is 
the same 0{n^). 

In contrast, no protocol exists in the symmetric model for Naming, even 
assuming the availability of consensus objects and the weak adversary. 

Proposition 2. Suppose that an asynchronous, shared memory machine is such 
that: 

— the memory is symmetric; 

— every process has access to a source of independent, unbiased random bits^ 
and 

— consensus objects are available. 

Then, still. Naming is impossible even against a weak adversary. 

Proof. (Sketch) By contradiction suppose there exist such a protocol. Consider 
two processes P and Q and let only Q go. Since the protocol is wait-free there 
exists a sequence of steps cr = siS2 . . . Sn taken by Q such that Q decides on 
a name ka* The memory goes through a sequence of states momi . ,.mn. The 
sequence cr has a certain probability = pip 2 -Pn of being executed by Q. 
Start the system again, this time making botli P and Q move, but one step at a 
time alternating between P and Q. With probability p\ both P and Q will make 
the same step s\. A simple case analysis performed on the atomic operations 
(read, write, invoke consensus) shows that thereafter P and Q are in the same 
state and the shared memory is in the same state mi in which it was when Q 
executed si alone. This happens with probability pf. With probability p^, if P 
and Q make one more step each, we reach a situation in which P and Q are in 
the same state and the memory state is m 2 . And so on, until, with probability 
both P and Q decide on the same identifier, a contradiction. 

Thus, naming is ^*harder” than consensus. The next fact shows that naming 
is a necessary assumption in Herlihy^s universality construction. The proof is 
omitted from this extended abstract. 
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Proposition 3, If processes are identical, deterministic i/o automata without 
identifiers then Naming is impossible, even if memory is asymmetric and the 
adversary is weak. 

3 Randomization (Naming 4- Symmetry = 
Asymmetry) 

In this section we exhibit a protocol to show the non-trivial part of the above 
"equation” namely, that asymmetric memory is enough for naming, provided 
randomization is available. 

Let us start by informally describing protocol squeeze whose task is to assign 
a unique identifier to each one out of a set of n processes, using a name space of 
size n. For now, let us assume the availability of objects selectWinner{i) with 
the following semantics. The object can be invoked by a process p with a pa- 
rameter i\ the object response is to return the value ^'You own key i!” to exactly 
one of the invoking processes, and “Sorry, look for another key” to all remaining 
processes. The choice of the "winner” is non-deterministic. Later we will show 
that selectWinner can be implemented efficiently in a wait-free manner in our 
setting. With selectWinner a naming protocol can be easily obtained as follows: 
Try each key one by one, in sequence, each time invoking selectWinner. This is 
shown in Figure 2. This protocol always works but linearly many processes always 
perform linearly many invocations of selectWinner, something which could be 
quite expensive. Thus, we turn our attention to protocol squeeze which, with 
high probability, will only perform 0(log^ n) such invocations. 

Let us then turn our attention to protocol squeeze. The name space is di- 
vided into segments, defined by the following recurrence, where p is a parameter 
to be fixed later: 

= p{^ 

Si is the last value S{ such that Si > clog^n (c a parameter to be set by the 
user; the bigger the c the higher the probability that selectWinner will be in- 
voked only 0(log‘^ n) many times). The first segment consists of the key interval 
/o [0,5i); the second segment consists of the key interval h := [si,si -h 52); 
the third of the key interval ^2 := [.^1 4- S2,si + S2 4- S3), and so on. The final 
segment Ii consists of the last n — j sj keys. In the protocol, each pro- 
cess p starts by selecting a tentative key i uniformly at random in /q. Then, it 
invokes selectWinner(i); if p "wins,” the key becomes final and p stops; oth- 
erwise, p selects a second tentative key j uniformly at random in 7i. Again, 
selectWinner(j) is invoked and if p “wins” J becomes final and p stops, other- 
wise p continues in this fashion until // is readied. The keys of h are tried one 
by one in sequence. If at the end p has no key yet, it will execute the protocol 
easyButExpensive of Figure 2 as a back-up procedure. The resulting protocol 
appears in Figure 3. We will show that: (a) with high probability every process 
receives a unique key before the back-up procedure, and (b) with probability 1 
every process receives a key by the end of the back-up procedure. 
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The intuition behind squeeze is this. On the one hand, if the segments are 
small enough then, with high probability, for every i, selectWiniier(i) will be 
invoked by some process and, consequently, every key i will be assigned (all 
processes currently looking for a key are “squeezed” in an interval). In turn, 
this implies that no p will execute the backup procedure. On the other hand, 
if the segments are big enough , the number of intervals /,•, corresponding to 
the number of attempts before the backup procedure, will be small. Let us now 
quantify this intuition. Let Pi be defined by the following recurrence 



;;<:=(l-p)‘ 'n. 

If there are no crashes, the number of processes which perform the i-th attempt, 
i.e. those processes which select a tentative key in /<, is obviously at least p». 
We want to show that, with high probability, it is at most pi, for i < €. If we 
can show this we are done because then pi = si and the protocol ensures that 
every one of the remaining pi process will receive one of the last st keys. A key 
k is claimed if selectWinner (k) is invoked by some process. Suppose first that 
there are no crashes. Then, 

Pv[3k e h, k not claimed] < < exp < exp 

for any fixed c > 0, provided that 




_ 1 
felogn* 

With this value of p the number of segments, i.e. £, is O(log^n). The expected 
running time is therefore 

T{n) = 0(log^ 7i)(l - -h 0(n)n“*+'' = 0(log^ n) 

for Jt > 3. It can be shown that the running time is the same order of the 
expectation with high probability. 

We now give an informal argument showing that when there are crashes the 
situation can only improve, deferring a formal justification to the full paper. A 
run of the protocol can be modelled in an equivalent fashion as follows. We have 
n bins, one for each key, and n white balls, one for each process. As for the 
key space, the n bins are organized into segments /<. A process selects a key at 
random from /< by throwing a ball in the corresponding bin interval. A key k is 
claimed if bin k is hit by at least one ball. A run without crashes corresponds to 
the following process. The n balls are thrown independently at random into the 
first segment /i; for each bin which is hit a ball is selected and discarded. The 
remaining balls are thrown independently at random into h, and so on. Let us 
denote by pi the number of bins from li which are hit. A run with crashes is 
modelled as follows. When the adversary crashes process p the corresponding ball 
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is painted red. In our balls-and-bins experiment we keep throwing all balls, red 
and white, but when a bin is hit we always discard a white ball. If we denote by 
Wi the number of processes which perform the ith attempt and by n the number 
of processes crashed before their ith attempt, then p< < n + u)<. This is because 
for each ball which is discarded in a run without crashes we can always find a 
corresponding (unique) ball in the other run which corresponding to a crashed 
process or to a proce.ss who obtained a key. Therefore, with high probability, at 
the end each process is either crashed or it has received a unique key. 

We now show how to implement selectWinner in a wait-free manner in poly- 
nomial tlnrje. We will assume the availability of objects of type consensus (i,b) 
where 1 < t < n and 6 6 {0, 1}. Each invoking process p will perform the invo- 
cation using the two parameters i and 6; the object response will be the same 
to all processes and will be “The consensus value for i is v” where v is one of 
the bits 6 which were proposed. This can be assumed without loss of generality 
since consensus can be implemented in a wait-free manner in the asymmetric 
model. Using consensus objects will simplify the presentation. The protocol for 
selectWinner, shown in Figure 4, is as follows. Each process p generates a bit 
6f at random and invokes consensus (1,6^). Let vi be the response of the con- 
sensus object. If 54 vi then p is a loser and exits the protocol. Otherwise, p is 
still in the game. Now the problem is to ascertain whether p is alone, in which 
case it is the winner, or if there are other processes still in the game. To this end, 
each remaining process scans the array W[l,tj, for 1 < t < n, which is initialized 
to all O’s. If W[1 , f] contains a 1 then p declares itself a loser and exits; otherwise 
it writes a 1 in its private position W[l,p] and scans W[l,-] again. If W[I,-] 
contains a single 1, namely VF[l,p] then p declares itself the winner and grabs 
the key, otherwise it continues the game that is, it generates a second bit at 
random, invokes consensus (2, i^), and so on. The following facts establish the 
correctness of the protocol. Their proof is omitted from this extended abstract. 

Proposition 4. 7/p declares itself the winner then it is the only process to do 
so. 

Proposition 5. There is always a process which declares itself the winner. More- 
over, with probability 1 — o(l) every process p generates O(logn) many random 
bits bf. 

These facts establish the correctness of the protocol and the high proba- 
bility bound on the running time, since consensus can be implemented in the 
a.symmetric model in pol 3 momial time. 

Remark 1: Protocol squeeze is also a good renaming protocol. Instead of the 
random bits, each process can use the bits of its own IDs starting, say, from the 
left hand side. Since the ID’s are all different the above scheme will always select 
a unique winner within 0(|/£>|) invocation of consensus. 
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protocol simpleButExpensiveO : keyj 
begin 

for k := 1 to n do 

if sele ct Winner (k) = "Koix oum hey k!** then return(k); 
end 



Fig, 2. Simple but expensive protocol for Naming 



protocol squeeze (): key; 
begin 

for i :** 1 to £ do begin 

k :=* random key in interval U\ 

if selectWinner (k) “ "Koit o\m hey k!** then return (k); 
end; 

for k ;= n “ 5 / to n do {try key in It one by one} 

if selectWinner (k) = own hey k!** then return(k); 

return(simpleButExpensive 0 ) {back up procedure) 

end 



Fig. 3. Protocol squeeze 



protocol selectWinner (i ; key); 

myBit := ' ‘private bit of executing process^*; 

attempt 1; 

repeat 

b random bit; 

if (b * consensus (i, b)) then begin 
scan W[attempt,j] for 1 < J < n; 
if (WCattempt, j] * 0, for all j) then begin 
W [attempt , myBit] :« 1; 
scan W[attempt,j] for 1 < j < n; 

if (W [attempt, j] « 0* for all j <> myBit) then return (i) ; (key is grabbed!) 
else attempt :» attempt + 1; {keep trying) 

else return( “Sorry, look for another key.**); 
end repeat 



Fig. 4. Protocol selectWinner 
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Remark 2: The only part of protocol squeeze that actually uses the xneraory 
is protocol selectWinner. In view of Fact 2 this task must be impossible with 
symmetric memory, even if randomness and consensus are available. Thus, this 
is another task for which, strictly speaking, Herlihy’s result does not hold and 
it is another example of something that cannot be accomplished by the power 
of randomization alone. 
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Polynomial and Adaptive Long-lived 
{2k — 1)-Renaming* 

(Extended Abstract) 
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Abstract. In the long-lived Af -renaming problem, processes repeatedly 
obtain and release new names taken from a domain of size Af. This 
paper presents the first polynomial algorithm for long-lived {2k — 1)- 
renaming. The algorithm is adaptive as its step complexity is 0{k^)] here 
k is the point contention — the maximal number of simultaneously active 
processes in some point of the execution. Polynomial step complexity is 
achieved by having processes help each other to obtain new names, while 
adaptiveness is achieved by a novel application of sieves. 



1 Introduction 

Distributed coordination algorithms are designed to accommodate a large num- 
ber of processes, each with a distinct identifier. Often, only a few processes 
simultaneously participate in the coordination algorithm [19]. In this case, it is 
worthwhile to rename the participating processes [6, 21]: Before starting the co- 
ordination algorithm, a process uses getName to obtain a unique new name — a 
positive integer in the range {1, . . . , Af}; the process then performs the coordi- 
nation algorithm, using the new name instead of its identifier; when the coordi- 
nation algorithm completes, releaseName allows the name to be re-used later. 

A renaming algorithm guarantees adaptive name space if Af is a function 
of fc, the maximal number of processes simultaneously obtaining new names; 
k is called the point contention. Obviously, Af should be as small as possible, 
preferably linear in fc; it is known that Af > 2fc — 1 [15, 18]. For the renaming 
stage to be useful it must also have adaptive step complexity: The number of steps 
a process takes in order to obtain a new name (or to release it) is a function of k. 
Under this definition, getName is delayed only when many processes participate 
simultaneously. (Precise definitions appear in Section 2.) 

This paper presents an adaptive algorithm for long-lived renaming, using 
read and write operations. In our algorithm, a process obtains a name in the 
range {l,...,2fc — 1} with 0{k^) steps. Thus, the algorithm's step complexity 
is a function of the maximal number of processes simultaneously active at some 
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Due to space limitations, many details are omitted from this extended abstract; a 
full version of the paper is available through www.cs.technion.ac.il/^hagit/pubs.html. 
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point during the getName operation. This is the first long-lived renaming algo- 
rithm with optimal name space whose step complexity is polynomial in the point 
contention. All previous long-lived renaming algorithms providing linear name 
space have exponential step complexity [16]. 

The algorithm uses a building block called sieve [9]; sieves are employed in 
all known algorithms which adapt to point contention using read and write 
operations [1, 2, 4, 5]. A process tries to win in a sequence of sieves, one after 
the other, until successful in some sieve; when successful, the process suggests 
to reserve this sieve for some (possibly other) process. The sieve is reserved for 
one of the processes suggested for it. A process fails in a sieve only if some other 
process is inside this sieve; this is used to show that a process accesses sieve a 
only if 0{s) processes simultaneously participate. 

Attiya et al. [6] introduce the one- 5 ^ot renaming problem and present a {2k — 
l)-renaming algorithm for the message passing model; Bar-Noy and Dolev [10] 
translate this algorithm to the shared-memory model. The complexity of these 
algorithms is exponential [16]. Gafni [17] describes a one-shot {2k — l)-renaming 
algorithm with 0{n^) step complexity. Borowsky and Gafni [12] present a one- 
shot {2k — l)-renaming algorithm with O(iV^n) step complexity. 

Several adaptive renaming algorithms were suggested recently. Attiya and 
Fouren [7] present a (6fc— l)-renaming algorithm with 0{k log k) step complexity. 
Afek and Merritt [3] extend this algorithm to a (2fc — l)-renaming algorithm 
with step complexity. The {2k — l)-renaming algorithms of Gafni [17] 

and Borowsky and Gafni [12] can be made adaptive using an adaptive collect 
operation [8] . These algorithms are one-shot and adapt to the total eontention: 
Their step complexity depends on the total number of operations performed so 
far, and does not decrease when the number of participating processes drops. 

Burns and Peterson [15] present a long-lived {2k — l)-renaming algorithm 
whose step complexity is exponential [16]. 

Anderson and Moir [21] considered a system where many processes {N) may 
participate in an algorithm but in reality no more than n N processes are 
active. This paper and subsequent work [14, 20, 22] presented long-lived re- 
naming algorithms where n is known in advance, culminating in a long-lived 
{2k — l)-renaming algorithm [20]. This algorithm employs Burns and Peterson’s 
algorithm [15] and thus its step complexity is exponential in n. 

Long-lived renaming algorithms that adapt to point contention were pre- 
sented recently [1, 2, 9]. Some of these algorithms have O(fc^logfc) step com- 
plexity but they yield a name space of size O(fc^); others provide linear name 
space (either 2fc — 1 or 6fc — 1 names), but their step complexity is exponential. 

2 Preliminaries 

In the long-lived M -renaming problem, processes pi, . . . repeatedly acquire 
and release distinct names in the range { 1 , . . . , Af } . A solution supplies two pro- 
cedures: getName returning a new nam^e^ and releaseName; Pi alternates between 
invoking getNamei and releaseName^, starting with getNamei. 
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Consider a, an execution of a long-lived renaming algorithm; let be a finite 
prefix of a. Process pi is active at the end of if includes an invocation 
of getNamei without a return from the matching releaseName^. A long-lived 
renaming algorithm should guarantee uniqueness of new names: Active processes 
hold distinct names at the end of ah 

The point contention (abbreviated contention below) at the end of a^, denoted 
PntCont(a^), is the number of active processes at the end of ah Consider a finite 
interval (3 of a; we can write a = ai/?a 2 . The contention during /?, denoted 
PntCont(/3), is the maximum contention in prefixes aiff of a\fi. If PntCont(/3) = 
kj then k processes are simultaneously active at some point during fi, 

A renaming algorithm has an adaptive name space if there is a function 
AI, such that the name obtained in an interval of getName, /?, is in the range 
{C...,AI(PntCont(/3))}. 

A renaming algorithm has adaptive step complexity if there is a bounded 
function S, such that the number of steps performed by pi in any interval of 
getName^, /?, and in the matching release Name^ is at most S(PntCont(/3)). The 
contention during an interval is clearly bounded by n. Therefore, getName^ and 
releaseName^ terminate within a bounded number of steps of pij regardless of 
the behavior of other processes; hence, the algorithm is wait- free, 

3 The Basic Sieve 

This section describes the basic sieve [9], re-organized so it can be extended (in 
Section 4.2 below) to support reservations. 

A sieve allows processes to obtain a view of the processes accessing the sieve 
concurrently, or no view (an empty view). There is a unique non-empty set of 
candidates which are seen by all processes getting a view. The sieve guarantees 
agreement on the information announced by candidates; this synchronizes the 
processes accessing the sieve and allows to exchange information in an adaptive 
manner. 

A sieve has an infinite number of copies; at each point in the execution, 
processes are only “inside” a single copy of the sieve. The number of the current 
copy is monotonically incremented by one. 

The sieve supports the following operations: 

read{s, count): get the number of the current copy of sieve s, 
openFor(5, c): returns all, if all operations can enter copy (^,c), and 0 if no 
process can enter copy (^,c). 

enter(5, c, info): enter copy (^, c), announcing info^ returns the set of candidates, 
together with the information announced by them. 
exit(^,c): leave the sieve and activate the next copy, (^,c+ 1), if possible. 

To access sieve a process first reads the number of the current copy from 
s. count. The process enters the sieve using enter only if openFor returns all, and 
leaves the sieve using exit. A process accesses sieve a in the following order: 
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1. c = read{s. count) j j find the current copy 

2. if (openFor(5,c) == all) then // copy (^,c) is open 

3. enter(s, c, in/o) // enter (s,c) and announce info 

4. exit(s,c) //leave the sieve 

3.1 Implementation of the Basic Sieve 

Sieves are implemented using ideas from the Borowsky-Gafni simulation [11, 13], 
modified to guarantee adaptive step complexity. 

A sieve has an infinite number of copies. At each point of the execution, can- 
didates are inside a single copy; these processes access the sieve simultaneously. 

A process tries to get inside a sieve by checking if it is among the first 
processes to access the current copy of this sieve. It succeeds only if this copy 
is free (no other process is already inside it), and no candidate is inside the 
previous copy. If a process does not get inside the sieve, then some concurrent 
process is already inside the sieve. In this manner, the sieve “catches” at least 
one process: one of the processes which access the current copy enters the sieve. 
This property makes the sieve a useful tool in adapting to contention. 

For each sieve there is an integer variable, county indicating the current copy 
of the sieve; it is initially 1. There is an infinite number of copies for each sieve, 
numbered 1,2, — Each copy has the following data structures. 

— An array i?[l,...,iV] of views; all views are initially empty. Entry R[idi] 
contains the view obtained by process idj in this copy. 

— An array done[l,. • • ,iV] of Boolean variables; all entries are initially false. 
Entry done[idi] indicates whether process idj is done with this copy. 

— A Boolean variable allDonCj initially false. Indicates whether all processes 
which could be inside this copy are done. 

— A Boolean variable inside j initially false. Indicates whether some process is 
already inside this copy. 

With each copy, we associate a separate adaptive procedure for one-shot scan of 
the participating processes, latticeAgreement [7], with O(fclogfc) step complexity. 

The pseudocode appears in Algorithm 1. Eor simplicity, we associate a virtual 
Boolean variable, allDonCj with copy 0 of every sieve, whose value is true. 

3.2 Properties of the Basic Sieve 

As mentioned before, a process accessing the sieve gets a view of the processes 
in the sieve concurrently with it, or gets an empty view. Among the processes 
which get a view, there is a unique non-empty set of candidates which are seen by 
all processes; candidates are in the same copy of the sieve simultaneously. Some 
candidates are winners, who update this copy’s data structures, e.g., increment 
count hy 1, in a synchronized manner. 
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Algorithm 1 Long-lived adaptive sieve: code for process pi, 
data types: 

processID: int 1 . . . iV // process’ id 

view: vector of {ID : processID, INFO : information field) 



view procedure enter(^, c: int, info: information field) // enter (5,c) with info 
1: s. inside[c]= true // notify that there is a process inside copy (^,c) 

2: V — 5.latticeAgreement[c]( info ) // announce info (from [7]) 

3: s.R[c][idi] = V // save the obtained view 

4: return candidates(^,c) // return the set of candidates 



void procedure exit( 5, c: int ) 

1: s.done[c][idi] = true 

2: ID=candidates(^,c) 

3: if {idi e W) then s. count = c + 1 

4: if {W ^ 0 and 'iidj € ID, s.done[(^[idj\ 
5: releaseSieve(5, c, ID) 



/ / leave the sieve 
/ / Pi is done 
// get the set of candidates 
// Pi is a winner in (5,c) 
true) then // candidates are c 
/ / release the sieve 



boolean procedure openFor(^, c) // check whether copy (^,c) is open 

1: if {8.allDone[c — 1] and // all candidates of the previous copy are done 

not 8.inside[Ff) // and no process is inside the current copy 

2: then return all // the copy is open for all processes 

3: else return 0 // the copy is open for no process 

view procedure candidates(5,c) // returns the candidates of (^,c) 

1: V — 8.R[F\[idi] 

2: ID = min{5.E[c][idj] | idj € V and 8.R[c][idj] ^ 9} f f min by containmeii 

3: if € ID, 8.R[c] [idj] D ID then return ID 

4: else return 0 



void releaseSieve(5, c, ID) // update data structures and release the sieve 
1: 8.aUDone[c] = true // release the sieve 



For copy c of sieve procedure candidates(^, c) returns either an empty view 
(in Line 4) or a non-empty view (in Line 3). The key agreement property of the 
sieve is that all non-empty views returned by candidates(5, c) are equal: 

Lemma 1. and ID 2 are non-empty views returned by invocations of Candida 

c) then IDi = ID 2 . 



Process Pi is a candidate in copy c of sieve 8 if it appears in the non-empty 
view returned by candidates(c, 5 ) (by Lemma 1, this view is unique). Process Pi 
is a winner in copy c of sieve 5 if the view obtained by Pi from candidates(^,c) 
contains Pi itself (see Line 2 of exit(^,c)), and in particular, is not empty. Note 
that a winner is in particular, a candidate. 
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Lemma 2. If Pi is a candidate in copy c of sieve then Pi appears in every 
non-empty view returned by an invocation of candidate$(s^ c). 

Process Pi is inside copy c of sieve 5 when it sets s.inside[c] to true in 
Line 1 of enter. A process inside copy c of sieve a is done after it assigns true to 
s.done[c][idi] in Line 1 of exit(^,c). The next lemma states that candidates are 
only inside a single copy of a specific sieve. 

Lemma 3. If process pi is inside copy c of sieve a then all candidates of smaller 
copies^ 1, . . . , c — Ij of sieve a are done. 

By Lemmas 2 and 3 and the code, processes write the same values to a.inside[c 
and a.allDone[(^. Thus, each of these variables changes only once during the ex- 
ecution. Also, ^.cownt increases exactly by 1. This implies the next proposition: 

Proposition 4 (Synchronization). The variables of sieve a change in the fol- 
lowing order (starting with amount = c — 1): 

1, insidefc — 1/ = true 

2, count = c 

3, allDonefc — 1/ = true 

4 , insidefc] = true 

5, count = c + 1 

6, allDonefc] = true 

7, insidefc + IJ = true 

8, count = c + 2 etc. 

It can be shown that processes enter a specific copy of a sieve simultaneously^ 
and can not do so at different points. The next lemma states that at least one 
of the processes which are inside a copy of a sieve is a winner. 

Lemma 5. At least one of the processes which assign true to a.inside[c] is a 
winner of copy c of sieve a. 

By Lemma 2, a non-empty view returned by candidates includes all candi- 
dates. It is possible that Pi obtains an empty view from candidates and is not 
a winner, yet later processes will see Pi in a non-empty view obtained from 
candidates. 

Process Pi accesses copy (5,c) if it reads c from amount; pi enters (5,c) if it 
performs enter(5,c); otherwise,]?^ skips (a^c). 

The next lemma shows that a process does not win a sieve only if some 
concurrent process is a candidate in this sieve. 

Lemma 6. If Pi accesses sieve a and does not win^ then there is a candidate of 
sieve a which is inside sieve a concurrently withpj. 

The step complexity of enter is dominated by the O(fclogfc) step complexity 
of latticeAgreement [7]; the step complexity of exit is 0{k). 



(a^c) is open 
(5,c) is not open 

(^,c+ 1) becomes open 
(^,c+ 1) is not open 
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4 Polynomial Adaptive Long-lived {2k — 1)-Renaming 

Our algorithm employs a helping mechanism from the renaming algorithm of 
Gafni [17]. In Gafni’s algorithm, a process tries to reserve names for processes 
with smaller priority and then obtains a name reserved for it. When there are 
several reservations for the same process, a reservation for a small name overrules 
reservations for larger names. 

Gafni’s algorithm is not long-lived as it has no mechanism for releasing a 
name or cancelling a reservation. His algorithm represents names with arrays; 
this is not adaptive since 0{N) operations are required to read the arrays. 

Our algorithm extends the basic sieve (described in Section 3) to support 
reservations: If sieve a is reserved for process Pi, then only pi can enter the sieve 
and get a new name a. Sieves can be entered repeatedly, in an adaptive manner, 
which makes them appropriate for long-lived adaptive algorithms. 

As in the basic sieve, the modified sieve has a set of candidates] the sieve is 
reserved for the minimal process suggested by the candidates. In Gafni ’s algo- 
rithm, no reservation is made when several processes are suggested for the same 
name; in our algorithm, progress is made even if there is a collision since there 
is agreement on the reserved process. 

In our algorithm, processes’ priorities are based on timestamps. In a get Name 
operation, process acquires a timestamp T5^ , and the operation is identified 
by {TSijidi). Timestamps are comparable, and earlier operations have smaller 
timestamps. In contrast, Gafni ’s algorithm uses static priorities for processes, 
based on their identifiers. 



4.1 The Renaming Algorithm 

In Algorithm 2, a process wishing to obtain a new name first gets a timestamp. 
Then, it repeatedly tries to reserve names for processes with smaller timestamps 
until it sees a reservation for itself. Finally, it gets a name from one of the sieves 
reserved for it. 

To obtain the optimal name space, reservations are made only for processes 
which are active when getName starts. The set of active processes is maintained 
using the following adaptive procedures of Afek et al. [4]: In join(id^), a process 
announces it is active, while in leave (idi), a process announces it is no longer 
active; getSet returns the current set of active processes. It is guaranteed that 
processes which complete join before the beginning of getSet and start leave after 
the end of getSet appear in the set returned by getSet. A process completing join 
after the beginning of getSet or starting leave before the end of getSet may or 
may not appear in this set. The step complexity of these procedures is [4]. 

Timestamps are obtained (in an adaptive manner) in procedure getTS, from 
a separate chain of sieves, using an idea we presented in [1]. A timestamp is a 
sequence of integers read from the counters of these sieves. Timestamps are com- 
pared by lexicographic order; if two timestamps have different lengths, then the 
shorter one is extended with the necessary number of +oo. It can be shown that 
non-overlapping getTS operations return monotonically increasing timestamps. 
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In order to reserve a name, process Pi finds the smallest “empty” sieve which 
is not reserved, or which is overruled by a reservation in a smaller sieve. Pi enters 
the current copy of this sieve, announcing which process it is suggesting. Then, 
Pi computes the candidates of this copy; if all candidates of this copy are done, 
then Pi reserves the next copy of the sieve for the minimal process suggested by 
them, and exits the sieve immediately. Since processes agree on the suggestions 
of candidates, it is clear for which process to reserve the next copy. 

To get a name for itself, Pi tries to enter sieves reserved for it. If Pi is the 
single winner in sieve 5, then its new name is s; otherwise, Pi moves to the 
preceding (smaller) sieve; in this case, a smaller sieve must be reserved for pi. 
An unbounded array, res[{TSi,Pi)], is used to notify process that a reser- 
vation was made for its operation with timestamp TSi. Section 6 discusses how 
to bound this array as well as the number of copies for each sieve. 



4.2 The Modified Sieve 

The basic sieve is modified so that if sieve 5 is reserved for an operation ouk^ then 
only process Pi performing oidi can enter sieve s. The modified sieve supports 
the same operations as the basic sieve, with open For extended as follows: 

openFor(5, c): returns all if all operations can enter copy (^,c), 0 if no process 
can enter copy (^,c), and oidi if only oidi can enter copy (5,c). 

Algorithm 3 presents the modified procedures, open For and releaseSieve. 
Process Pi executing oidi accesses the modified sieve in the same order as for 
the basic sieve, adapted to accommodate the change in open For: 

1. c = read{s. count) j j find the current copy 

2. if (openFor(5,c) == all or oidi) j j copy (^,c) is open or reserved for oidi 

3. then enter(^, c, in/o) // enter (5,c) and announce info 

4. exit(5,c) //leave the sieve 

5 Proof of Correctness 

5.1 Uniqueness 

The uniqueness of new names follows from properties of the basic sieve, proved 
in Section 3.2. In getNameForSelf, a process gets a name ^ only if it is a single 
winner in sieve s (that is, candidates returns {oidi}). Lemmas 1 and 3 imply: 

Proposition 7. No two processes hold the same name after a finite prefix. 
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Algorithm 2 Long-lived {2k — l)-renaming: code for process 
data types: 

timeStamp: string of integers 

operationID: {TS : timeStamp, ID : processID) 



int 

1 : 

2 : 

2 : 

3: 

4: 

5: 

5: 

6 : 

7: 

8 : 



procedure getName( ) 
TSi = getTS( ) 

oidi = {TSij idi) 

currOID[idi] = oidj 
jO\n{oidi) 

O — getSetO 
repeat 

oidr = min{oid € O 

reserve( oidr) 



/ / get a new name from the range 1 . . . 2fc — 1 

/ / get a timestamp 
// id of the current operation 
// announce the id of the current operation 
// join the active set [4] 
// get operations of the active processes [4] 



res[oid\ == ±} // try to reserve for 

// the earliest operation without a reservation 
until {start-sieve — res[oidi] # -L) // have a reservation 

return getNameForSelf(oidi, start-sieve) j j get a name in a reserved sieve 



void procedure reserve( oidr) // try to reserve a sieve 

1: s = 0 

2: repeat forever // loop over sieves 

3: 5 + + 

4: c—s.eount 

5: if (openFor(s,c) == all) // copy is empty and without a reservation 

7: enter(^, c, oidr ) // try to reserve for oidr 

8: exit(^, c) // leave the sieve 

9: return 

int procedure getNameForSelf( oid^, start-sieve ) // get a name in a reserved sieve 
1: for {s = start-sieve down to 1) 

2 : c=s.eount 

3: if (openFor(s,c) == oidi) fh^n // (s,c) is empty and reserved for oidi 

6: W — enter(s,c, ±) //no suggestion for the next copy 

7: if {W == {oidi}) fh^n return s // Pi is the single winner in sieve s 

8: else exit(s,c) //a smaller sieve is reserved for oidi 



void procedure release N a me ( ) 

1: exit(s,c) // leave the sieve; (s,c) is remembered from the last call to enter 

2: leave(oidi) // leave the active set [4] 
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Algorithm 3 The modified sieve: new code for open For and releaseSieve. 
data type: 

accessList: operationID |J {0,all} // operations which can access a sieve 

accessList procedure openFor(^,c) // which operations can access (^,c) 

1: if (not 8.allDone[c — 1] or s Anside[(^) then return 0 // as in Algorithm 1 

2: oidr — s.res[c] 

3: if {oidr == -L //no reservation 

4: or for some — 1, . . . , 1}, count and // reservation for oidr is 

.allDone[c^ — 1] and s^.res[c^] == oidr j j overruled in a smaller sieve 
5: or currOID[oidr-ID] ^ oidr) 1 1 Pr started a new operation 

6: then return all // the sieve is open for any operation 

7: else return oidr j j the sieve is reserved for oidr 

void releaseSieve(5, c, IT) // reserve (^,c) for the earliest operation suggested 
1: oidr = min^^^ oid)^w II operation suggested by candidates 

2: ^.r65[c+ 1] = oidr j j reserve the next copy for oidr 

3: 8.allDone[(^ = true // as in Algorithm 1 

4: res[oidr] = 8 j j notify Pr that it has a reservation in sieve ^ 



5.2 Properties of the Reservations 

Sieve 5 is reserved for operation oidr after a finite prefix 7 , denoted Res{8j^f) — 
oidrj if and only if 8.allDone\8.count — 1] = true (candidates of the previous 
copy are done), 8.inside[8.couni\ = false (no process is in the current copy) and 
8.res[8.couni\ = oidr (the current copy is reserved for oidr)^ 

By Lemmas 2 and 3 and the code, winners of copy (^,c) reserve the next 
copy for the same operation. This implies following extension of Proposition 4: 

Propositions (Synchronization). The variables of a sieve 8 change in the 
following order (starting with 8. count = c — 1): 

1, inside[c — 1/ = true 

2, count = c 

3, '^esfcj = oidr 
allDonefc — 1/ = true 

5, insidefcj — true 

6, count = c + 1 

7, resfc + 1 / = oidr^ 

8, allDonefc] = true 

9, insidefc + 1] = true 

10, count =c + 2 etc. 

The next leninia shows that if openFor(^, c) returns oidi^ then sieve ^ is indeed 
open for oidj. The first three statements mean that Res{8jj) = oidj. 



( 5 , c) becomes reserved for oidr 
( 5 , c) is no longer reserved 

(5,c+ 1) becomes reserved for ouR^ 
(5,c+ 1) is no longer reserved 
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Lemma 9. If open For (s^c) returns oidr^ then there is a prefix j whieh ends on 
the exeeution interval of Line 1 of openFor(s^c)^ sueh that s. allDone[s. count — 
1 ] = true^ s.inside[s. count] — false^ s.res[s. count] — oidr^ and s. count — c at 
the end ofj, 

MinRes{oidi,j) denotes the minimal sieve reserved for an operation oidj (of 
process Pi) after an execution prefix 7. Since a reservation for oidj in a smaller 
sieve overrules reservations for this operation in larger sieves, MinRes{oidi,j) is 
the actual reservation for oidj. For convenience, it is 00 if no sieve is reserved 
for oidj. It is 0 after pi enters the copy of the sieve from which it gets its new 
name; i.e., Pi writes true into 8.inside[8.eount], where 8. count is the copy of sieve 
8 from which pi gets its new name. 

We can prove that if openFor(s,c) returns all, then either there is no reser- 
vation in sieve 8 or the reservation is overruled by a smaller sieve. This allows 
to show that a process does not destroy the minimal reservation for another 
process. This is the key to proving that MinRes monotonically decreases: 

Lemma 10. Assume j is a finite prefix. Then for every prefix ofj and every 
operation ouk^ MinRes{oidi^^fi) ^ MinRes{oidi^^f) . 

The next lemma shows that open For recognizes the minimal reservation. 

Lemma 11. If openFor(s,c) returns oidr^ then there is a prefix j whieh ends on 
the execution interval of open For (s^c)^ such that MinRes{oidr,j) = 

Proof. By Lemma 9 there is a prefix 7 which ends on the execution interval of 
Line 1 of openFor such that Res{8.^j) = ouR^ Therefore, MinRes{oidrjj) < 

Assume, by way of contradiction, that MinRes{oidrjj) — 8^ < 8. Let 7^ 
be the shortest prefix such that MinRes{oidrjj^) — P. Suppose that Pi reads 
8 — l.counh . . . , 1 . count (in Line 4 of openFor) at the ends of prefixes 7^ ^1, ... ,71, 
respectively. By the code, • • • ,71 end after the end of 7. 

If 7 ^/ is included in 7 ^, then by Lemma 10, s > MinRes{oidrjjB^i) > • • • > 
MinResioidrjjsI) > P. By the pigeon hole principle, MinRes{oidrjji) — t foi' 
some £ € {s — 1, . . . ,s^}. By the sieve’s properties, pi reads l.aIlDone\c — 1 ] = 
true, and l^res[c^ = ouR (since these variables do not change after 7 /). Thus, 
the condition in Line 4 of openFor holds for pi and openFor returns all. This is 
a contradiction, and the lemma follows. 

Otherwise, 7 ^/ is not included in 7 b Since P is the minimal sieve reserved 
for oidr., no process writes to P. count from the end of 7^ until Pr writes a new 
value to currOID[oidr]- Since Pi reads 8hcount before Pr writes a new value to 
currOID[oidr]j the sieve’s properties imply that Pi reads true from 8hallDone[c— 
1 ], and ouR from 8hres[c] (since these variables do not change after 7^/). Thus, 
the condition in Line 4 of openFor holds for pi and openFor returns all. This is 
a contradiction, and the lemma follows. □ 
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5.3 Size of the Name Space 

Let 6i be an interval of getNamei; Pi has several attempts (calls to reserve) to 
make a reservation; let /3i be the interval of one invocation of reserve. Partition 
(3i into disjoint intervals during which pi accesses different sieves: If Pi accesses 
sieve ^ then is the interval which starts when pi reads from s. count (or the 

beginning of if ^ = 1) and ends when pi reads from s + 1. count (or the end 

of I3ij if Pi is a winner in sieve s). 

The next lemma shows that when a process skips a sieve in reserve, either 
some concurrent process is a candidate in this sieve or the sieve is reserved for 
another process. It follows from Lemmas 6 and 1 1 and the code. 

Lemma 12. If in reserve Pi skips sieve (5,c) in reserve then either 

(1) there is a candidate in (s^c) concurrently withpi^ or 

(2) there is an operation oidr such that MinRes{oidrjj) = for some prefix j 
ending in 

A potential method is used to show that only sieves < 2k are reserved; this 
bounds the name space, since a process obtains its new name only from a sieve 
reserved for it. We define two sets of simultaneously active operations and show 
that at least one of them grows by 1 when pi skips a sieve in reserve. Each set 
contains at most k — 1 concurrent operations (except Pi); this implies Pi skips 
at most 2k — 1 sieves, and makes a reservation in a sieve < 2k. However, as we 
show, the minimal reservation for a process is always < 2k. 

The first set, contains operations whose attempt ends in a sieve < 5, 
while the second set, contains operations reserved in a sieve < a. Formally, 
for a prefix 7 and a process Pi which performs an attempt with interval define: 

£"^(7) = {oidq I g is a candidate in a sieve < 5 whose attempt covers the end of 7} 

RsiliPi) = {oidq I q is active at the end of 7 and MinRes{oidq^ I3i\g) < 

E contains operations whose attempt covers the end of 7 while R contains oper- 
ations whose whole execution interval covers the end of 7. This refiects the fact 
that E depends on an operation’s execution, while R depends on the minimal 
sieve reserved for an operation by others. In the definition of 7 is used to 
determine which operations are active, but MinRes is evaluated at the end 
The next key lemma shows that a process accesses a sieve with a high index 
only if many processes are active concurrently with it. The proof is by induction 
on the sieve’s number, and considers the different cases in which a process skips 
a sieve; the proof is omitted due to lack of space. 

Lemma 13. Let fii be the interval of an attempt by pi. If Pi skips sieve then 
there is a prefix j which ends in AJifth • • • A |a such that |£^(7)|+|£^(7,Pi)| > 

The next lemma shows that the minimal sieve reserved for a process is < 2fc. 

Lemma 14- Let 6i be the execution interval of oidi of Pi If for some prefix 7^ , 
MinRes{oidijji) is not 00, then MinRes{oidijji) < where k = PntCont{5i) . 
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Proof. Let MinRes{oidi,ji) = s, and let pj be a process which suggests oidj in 
sieve s. Let /3j be the interval of pj’s attempt in which it skips sieves 1, . . . , 5 — 1 
and suggests oidj in (s,c— 1). By Lemma 13, 5 — 1 < \R^^i{jj,pj)\p 
for some prefix which ends in Let k = PntCont(/3j |^); k < PntCont(5i), 
since (3j\^ is contained in dj. By definition, R^^i{jj,pj) and £"^^ 1 ( 7 ^) contain op- 
erations which are simultaneously active at the end of 7 ^; hence, \R^^i{jj,pj)\ < 
k — 1. By definition, oidj ^ £^^ 1 ( 7 ^), which implies that |£^^i( 7 j)| < fc — 1. 

If oidj e Rs^i{jjjPj)j then by the definition of MinRes{oidij f3j\s^i) < 
^ — 1 . The properties of the sieve imply that pj accesses (s,c — 1) before sieve 
8 is reserved for oidj. That is, ends before 7 ^. Therefore, by Lemma 10, 

MinRes{oidijji) < MinRes{oidij /3j\s^i) < s — 1. The lemma follows since s — 1 < 

\Rg^i{jj jPj)\ + |£^_i(7j)| < 2k — 1. 

If oidj ^ £^^i( 7 j,Pj), then |£^_i( 7 j,pj)| < fc — 1. In this case, the lemma 
follows since s < \Rs^i{jjjPj)\ + |£^^i( 7 j)| + 1 < 2fc — 1 . □ 

This proves that the minimal reservation for an operation is in a sieve < 2k. 
We now show that Pi succeeds to get a name from one of the sieves reserved 
for it. The next lemma states that if Pi skips sieve a in getNameForSelf, then a 
smaller sieve is reserved for pj . 

Lemma 15. Assume that in getNameForSelf pi reads s.eount at the end of pre- 
fix js ttnd {a — l).eount at the end of prefix js-il Aat is^ pi skips sieve a. If 
MinRes{oidi^^fs) < ^ l^hen MinRes{oidi^^fs^\) < ^ ^ 1 * 

Finally, we prove that the algorithm provides optimal name space. 

Lemma 1 6. getNameForSelf returns a new name in the range {l,...,2 fc — 1 }. 

Proof. By the algorithm, pi calls getNameForSelf (Line 8 of getName) after it 
reads a nom± value, say, ss^ from res[oidi]. Sieve ss is reserved for oidj before ss 
is written to res[oidi]. 

Let Sm. be the minimal value of MinRes{oidi) . In getNameForSelf, Pi accesses 
sieves ss^ss — 1, — If Pi does not win some sieve 5 , then Lemma 15 implies 
that a smaller sieve is reserved for pj. Thus, Pi wins sieve Sm. at latest and 
getNameForSelf returns a name s > 0. 

It remains to show that the new name is < 2fc. By the algorithm, openFor(s, c) 
returns oidj (Line 3). Lemma 11 implies that MinRes{oidi,j) = s, for some prefix 
7. By Lemma 14, MinRes{oidi,j) < and hence, a < 2k. □ 

5.4 Step Complexity 

Let eleeted{a, c) be the minimal operation identifier suggested by candidates in 
copy ( 5 , c). By Lemma 1, processes agree on the set of candidates and the next 
copy of the sieve (s,c+ 1) is reserved for eleeted{a , c) . If process Pi gets its new 
name from copy (s,c), then pi is the single candidate in (s, c) and it writes ± in 
Line 5 of exit; in this case, elected{a^c) = T. Since a process makes suggestions 
in one sieve at a time, we can bound the number of different copies in which the 
same operation is elected. 
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Lemma 1 7, An operation is elected in at most 2k copies. 

Line 6 of getName guarantees that Pi repeatedly calls reserve as long as no 
reservation was made for it. To bound the number of attempts Pi makes, we 
bound the number of different operations encountered by Pi as eleeted{s, c) and 
the number of copies they are elected in. Using similar arguments, we also prove 
that 0 {k) attempts ofp^ end in a copy (^,c) such that eleeted{s,c) = ±. 

Lemma 18 . attempts of pi end in a copy (5,c) such that elected{s,^c) = 

oidj ^ ±. 0 {k) attempts of pi end in a copy (5,c) such that elected{s c) = ±. 

Thus, Pi discovers that a sieve is reserved for oidi and calls getNameForSelf 
after attempts. In each attempt, Pi skips 0{k) sieves and enters one sieve; 

in getNameForSelf, Pi enters 0{k) sieves. Skipping a sieve requires 0{k) steps 
(Line 9 of reserve); entering a sieve requires 0{klogk) steps [7]. Therefore, the 
total step complexity of the algorithm is 0(fc^[fc'fc + fclogfc] + fc’fclogfc) = O(fc^). 

6 Discussion 

We present a long-lived (2fc — l)-renaming algorithm with step complexity, 

where k is the maximal number of simultaneously active processes in some point 
of the execution. 

The algorithm described here uses an unbounded amount of shared memory. 
First, each sieve has an unbounded number of copies. Copies can be re-cycled 
(and hence bounded), see [1]. Second, an unbounded array is used to notify a 
process about reservations for each of its operations. This array can be replaced 
by a two-dimensional array in which each process announces for each other 
process the last timestamp it reserved for (and in which sieve). A process checks, 
for every process in the current active set whether a reservation was made for its 
current timestamp; then it tries to enter from the minimal reservation made for 
it, if there are any. The details are postponed to the full version of the paper. 
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Computing with Infinitely Many Processes 

under assumptions on concurrency and participation 
(Extended abstract) 



Michael MerritE Gadi Taubenfeld^ 



Abstract. We explore four classic problems in concurrent computing 
(election, mutual exclusion, consensus, and naming) when the number of 
processes which may participate is inhnite. Partial information about the 
number of actually participating processes and the concurrency level is 
shown to affect the possibility and complexity of solving these problems. 
We survey and generalize work carried out in models with hnite bounds 
on the number of processes, and prove several new results. These include 
improved bounds for election when participation is required and a new 
adaptive algorithm for starvation-free mutual exclusion in a model with 
unbounded concurrency. We also explore models where objects stronger 
than atomic registers, such as test&set bits, semaphores or read-modify- 
write registers, are used. 



1 Introduction 

1.1 Motivation 

We explore several classic problems in concurrent computing (election, mutual 
exclusion, and consensus) when the number of processes which may participate is 
infinite. Partial information about the number of actually participating processes 
and the concurrency level is shown to affect the possibility and complexity of 
solving these problems. This paper surveys and generalizes work carried out 
in models with finite bounds on the number of processes, and proves several 
new results. These include improved bounds for election when participation is 
required and a new adaptive algorithm for starvation-free mutual exclusion when 
the number of concurrent processes is not bounded a priori. 

Processes: In most work on the design of shared memory algorithms, it is as- 
sumed that the number of processes is finite and a priori known. Here we inves- 
tigate the the design of algorithms assuming no a priori bound on the number of 
processes. In particular, we assume the number of processes may be infinite. Al- 
though, in practice, the number of processes is always finite, algorithms designed 
for an infinite number of processes (when possible) may scale well: their time 
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complexity may depend on the actnal contention and not on the total number of 
processes. Starvation-free mutual exclusion has been solved in such a model us- 
ing two registers and two weak semaphores (see Theorem 15) [FP87]. Wait-free 
solvability of tasks when there is no upper bound on the number of participat- 
ing processes has also been investigated [GK98], but in this earlier work no run 
has an infinite number of participating processes. A model in which processes 
are continuously being created and destroyed (and hence their number is not a 
priori known) has also been considered [MT93]. 

Concurrency: An important factor in designing algorithms where the number of 
processes is unknown, is the concurrency levels the maximum number of processes 
that may be active simultaneously. (That is, participating at the same instant 
of time.^) We distinguish between the following concurrency levels: 

— finite: there is a finite bound (denoted by c) on the maximum number of 
processes that are simultaneously active, over all runs. (The algorithms in 
this paper assume that c is known.) 

— bounded: in each run, there is a finite bound on the maximum number of 
processes that are simultaneously active. 

— unbounded: in each run, the number of processes that are simultaneously 
active is finite but can grow without bound. 

— infinite: the maximum number of processes that are simultaneously active 
may be infinite. (A study of infinite concurrency is mainly of theoretical 
interest.) 

The case of infinite concurrency raises some issues as to the appropriate model 
of computation and semantics of shared objects. We assume that an execution 
consists of a (possibly infinite) sequence of group steps^ where each group step is 
itself a (possibly infinite) sequence of steps, but containing at most one primitive 
step by each process. Hence, no primitive step has infinitely many preceding steps 
by any single process. The semantics of shared objects need to be extended 
(using limits) to take into account behavior after an infinite group step. This 
leaves some ambiguity when a natural limit does not exist. For example, what 
value will a read return that follows a step in which all processes write different 
values? Natural limits exist for all the executions we consider. In particular, in 
the algorithm in Figure 5, the only write steps assign the value 1 to shared bits. 
The well-ordering of primitive steps in an execution of this model assures that 
there is a first write to any bit, and all subsequent reads will return 1. 

Participation: When assuming a fault-free model with required participation 
(where every process has to start running at some point), many problems are 
solvable using only constant space. However, a more interesting and practical 

^ We have chosen to define the concurrency level of a given algorithm as the maximum 
number of processes that can be simultaneously active. A weaker possible definition 
of concurrency, which is not considered here, is to count the maximum number of 
processes that actually take steps while some process is active. 
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situation is when participation is not required, as is usually assumed when solving 
resource allocation problems. For example, in the mutual exclusion problem a 
process can stay in the reminder region forever and is not required to try to 
enter its critical section. 

We use the notation [£, u]-parttctpatton to mean that at least £ and at most u 
processes participate. Thus, for a total of n processes (where n might be infinite) 
[1, n]-participation is the same as saying that participation is not required, while 
[n, n]-participation is the same as saying that participation is required. Requiring 
that all processes must participate does not mean that there must be a point 
at which they all participate at the same time. That is, the concurrency level 
might be smaller that the upper bound on the number participating processes. 
Notice also that if an algorithm is correct assuming [i, i/]-participation, then it 
is also correct assuming i/^]-participation, where i < f < u' < u. Thus, any 
solution assuming that participation is not required, is correct also for the case 
when participation is required, and hence it is expected that such solutions (for 
the case where participation is not required) may be less efficient and harder to 
construct. 

1.2 Properties 

We dehne two properties of algorithms considered in the paper. 

Adaptive algorithms: An algorithm is adaptive if the time complexity of pro- 
cesses’ operations is bounded by a function of the actual concurrency. The term 
contention sensitive was hrst used to describe such algorithms [MT93], but later 
the term adaptive become commonly used. Time complexity is computed using 
the standard model, in which each primitive operation on an objects is assumed 
to take no more than one time unit. In the case of mutual exclusion algorithms, 
we measure the maximum time between releasing the critical section, until the 
critical section is re-entered. For models in which there is a minimum bound on 
the number of participating processes, we measure one time unit to the first point 
at which the minimum number of processes have begun executing the algorithm. 

Symmetric algorithms: An algorithm is symmetric if the only way for distin- 
guishing processes is by comparing (unique) process identihers. Process id’s can 
be written, read, and compared, but there is no way of looking inside any identi- 
her. Thus, identifiers cannot be used to index shared registers. Various variants 
of symmetric algorithms can be defined depending on how much information 
can be derived from the comparison of two unequal identifiers. In this paper we 
assume that id’s can only be compared for equality. (In particular, there is no 
total order on process id’s in symmetric algorithms.)^ 

^ Styer and Peterson, [SP89], discuss two types of symmetry. Our notion corresponds 
to their stronger restriction, “symmetry with equality”. Some of our asymmetric 
algorithms, presented later, work in systems with their weaker symmetry restriction, 
“symmetry with arbitrary comparisons,” in that they depend upon a total order of 
the process id’s, but do not (for example) use id’s to index arrays. 
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1.3 Summary of results 

We assume the reader is familiar with the definitions of the following problems: 

1. The mutual exclusion problem, which is to design a protocol that guarantees 
mutually exclusive access to a critical section among a number of competing 
processes [Dij65]; 

2. The consensus problem, which is to design a protocol in which all correct 
processes reach a common decision based on their initial opinions [FLP85]; 

3. The (leader) election problem, which is to design a protocol where any fair 
run has a finite prefix in which all processes (correct or faulty) commit to 
some value in {0, 1}, and exactly one process (the leader) commits to 1; We 
do not require the leader to be identified to the other processes, although we 
do require all processes to terminate in fair runs. In a model with infinitely 
many processes, identifying the leader obviously requires infinite space-this 
weaker assumption makes the lower bounds more complex. To prevent trivial 
solutions, it is assumed that the ids are not initially known to the processes, 
although we do assume that process identities are natural numbers. 

4. The wait-free naming problem which is to assign unique names to initially 
identical processes. Every participating process is able to get a unique name 
in a finite number of steps regardless of the behavior of other processes. 

We show that even with a fixed bound on concurrency and required participa- 
tion, election (using registers) requires infinite shared space. Among the novel 
algorithms presented are two demonstrating that either a single shared register of 
infinite size, or infinitely many shared bits, suffice for both election and consen- 
sus. (In addition, the first algorithm is adaptive.) If in addition test&set bits are 
used, then solving the above problem requires only finite space; however, a result 
of Peterson ([Pet94]) implies that the more complex problem of starvation-free 
mutual exclusion (bounded concurrency, participation not required) still requires 
infinite space. In fact, even using read-modify- write registers to solve this prob- 
lem, a result of Fischer et al ([F"^89]) implies that infinite space is required. 
However, Friedberg and Peterson ([FP87]) have shown that using objects such 
as semaphores that allow waiting enables a solution with only constant space. 

When there is no upper bound on the concurrency level, we show that 
even infinitely many test&set bits not suffice to solve naming. Naming can be 
solved assuming bounded concurrency using test&set bits only, hence it sepa- 
rates bounded from unbounded concurrency. At the boundary of unbounded and 
infinite concurrency, Friedberg and Peterson ([FP87]) have shown a separation 
for starvation- free mutual exclusion using registers and weak semaphores: two of 
each suffice for unbounded concurrency, but the problem cannot be solved using 
these primitives for infinite concurrency. 

The tables below summarize the results discussed in this paper. As indicated, 
many are derived from existing results, generally proven for models where the 
number of processes is finite and known. We use the following abbreviations: 
DF for deadlock-free; SF for starvation-free; mutex for mutual exclusion; U for 
Upper bound; L for Lower bound; RW for atomic read/write registers; T&S 
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for test&set bits; wPV for weak semaphores; and RMW for read-modify- write 
registers. (The default is “No” for the adaptive and symmetric columns. All 
lower bounds hold for non- adaptive and asymmetric case.) 
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2 Atomic registers: participation is required 

This section demonstrates that reqniring a minimum number of process to par- 
ticipate (the required participation model) is a powerful enabling assumption: 
the problems we study are solvable with a small number of registers. However, 
reducing the number of registers does not mean that it is also possible to achieve 
a finite state space. We begin by showing that any solution to election in this 
model requires infinite state space. Then we show that election can be solved 
either by using an infinite number of registers of finite size or a (small) finite 
number of registers of infinite size. 

2.1 A lower bound for election when the concurrency is c > 1 

Even when participation is required, any solution to election for infinitely many 
processes must use infinite shared space. This holds even if processes do not need 
to learn the identity of the leader as part of the election algorithm. 

Theorem 1. There is no solution to election with finite concurrency c > 1 and 
with [c^ oo]-participation using finite shared memory (finitely many registers of 
finite size). 

Proof Assume to the contrary that there is such an algorithm using finite space. 
Consider the solo run solo{p) of each process p, and define write{p) to be the 
subsequence of write events in solo{p). For each process p, let repeat{p) be the 
first state of the shared memory that repeats infinitely often in solo{p). (Recall 
that the algorithm uses finite space by assumption. Also, if write{p) is finite, 
repeat{p) will necessarily be the state after the last write in write{p) ) 

Let l3{p) be the finite prefix of write{p) that precedes the first instance of 
repeat{p). For each such fi{p), we construct a signature subsequence, sign{p), 
by removing repetitive states and the intervening steps, as follows: Ignoring the 
states of the processes for now, let fi{p) = xq, ^ 2 ..., where the xj are the 
successive states of the shared memory in fi{p). Suppose xj is the first state that 
repeats in /?(p), and that Xk is the last state in fi{p) such that xj = x^. Remove 
the subsequence xj^i, ..., from fi{p). The resulting sequence xq, Xj, ... 

is a subsequence of /?( 6 ) with strictly fewer repeating states. Repeat this step 
until no state repeats-the resulting sequence is (the signature) sign{p). 

Since there are only finitely many states of the shared memory, there exists 
a single state s and a single sequence 7 such that s = repeat (p) and 7 = sign{p) 
for infinitely many processes, p. The solo runs of any subset of C = pi, ...,pc of 
these processes are compatible: by scheduling appropriately, we can construct 
runs in which each of the processes take the same steps as in their solo runs. By 
an easy case analysis, in one such run no leader is elected, or there are many 
runs in which two or more leaders are elected, a contradiction. 

The (compatible) run of the processes in C is constructed as follows: Let 7 
= •••? First run these c process until each one of them is about to execute 

its first write. Run process pi until it takes the last step in fi{pi) that leaves the 
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memory in state y\. Repeat for p 2 through then repeat this loop for each 
state in 7 . The resulting finite run leaves the shared memory in state and 
each of the c processes pi has taken the steps in /?(p*). Since the state s repeats 
infinitely often in the remainder of each of the solo runs solo{pi), the run can be 
extended by using a round-robin schedule to extend by some sequence of steps 
of each process, always returning the memory to state s. 

Notice that the signatures are used only in the first stage of the construction, 
when the processes run until the memory state is s, in such a way that each 
process “thinks” it ran solo to get there. After reaching (together) the recurring 
state s, each process can be run solo until s recurs. Since the same state s recurs 
infinitely many times for each of the processes in C, we can schedule each in 
turn that has not terminated to take (some finite, nonzero number) of steps 
until s occurs next, then schedule the next one (i.e., round robin scheduling). 
Now everyone either runs forever or terminates as in a solo run. □ 



2.2 Algorithms for election and consensus for concurrency c 

Although the previous theorem shows that election for infinitely many processes 
requires unbounded shared memory, when assuming participation is required, 
many problems become solvable with a small number of registers. We first study 
the scenario, in which the concurrency is equal to participation (concurrency 
c and participation [c, 00 ]). We show that election can be solved either by (1) 
using an infinite number of atomic bits, or ( 2 ) using a single register of infinite 
size. We also present two simple symmetric algorithms. 

Theorem 2. For any fimte concurrency c, with [c^ oo\-parttctpation^ there are 
non-adaptive asymmetric solutions to election (and consensus) using an infinite 
number of atomic bits. 

Proof We identify each process with a unique natural number. (But each natural 
number is not necessarily the identity of a participating process. We always take 
0 as a process id not assigned to any process.) The algorithm uses an infinite 
array 6[0],6[1], ... of bits, which are initially 0. The first step of process i is to 
read 6[0]. If b[Q] value is 1, it knows that a leader has already been elected and 
terminates. Otherwise, process i sets b[i] to 1. Then, process i scans the array 
until it notices that c other bits (other than b[0]) are set. In order not to miss 
any bit that is set, it scans the array according to a diagonalization: a schedule 
that visits each bit an inhnite number of times. (A canonical example is to first 
read 6[1]; then 6[1],6[2]; then 6[1], 6[2], 6[3], etc.) Once a process notices that c 
bits are set, it set b[0] to 1. The process with the smallest id among the c that 
are set to 1 is the leader. By scanning the bits after reading b[Q] as 1, the other 
processes can also learn the id of the leader. This solution for election can be 
trivially modified, using one other bit, to solve consensus. □ 

A symmetric election algorithm is presented in [SP89], for n processes (with 
concurrency level n), which use only three atomic registers. Below we present 
three election algorithms: a symmetric algorithm for concurrency 2 using one 
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register, a symmetric algorithm for concurrency c using two registers, and finally, 
an asymmetric algorithm for concurrency c using one register. For lack of space, 
we present only the code of the algorithms, and omit the detailed descriptions. 

Theorems. For finite concurrency c = 2 with [2^ oo]-participation^ there are 
adaptive symmetric solutions to election and consensus using one register. 

Proof, The algorithm is given in Figure 1. This election algorithm can be easily 

Process Cs program 
Shared: 

{Leader^ Marked): (Process id, boolean) initially (0,0) 

Local: 

local-leader: Process id 

1 if Marked = 0 then 

2 (Leader, Marked) := (*,0) 

3 await (Leader ^ i) V (Marked = 1) 

4 local-leader := Leader 

5 (Leader, Marked) := (local -leader, 1) 

6 fi 

7 return(Leader) 

Fig. 1. Symmetric election for concurrency 2 



converted into a consensus algorithm by appending the input value (0 or 1) to 
each process id. The input value of the leader is the consensus value. □ 

The next algorithm, for concurrency c > 2, is similar to the previous, in that 
the last participant to write the Leader field is then Marked as the leader. A 
second register. Union, is used by the first c participants to ensure that they 
have all finished writing their id into Leader. (Note, that in doing so, they learn 
each other’s id’s.) Because c is also the number of required participants, each of 
the first c participants can spin on the U nion register until the number of id’s 
in it is c. 

Theorem 4. For any finite concurrency c with [c, co]-participation, there are 
adaptive symmetric solutions to election and consensus using two registers. 

Proof, The algorithm is given in Figure 2. As above, this election algorithm can 
be easily converted into a consensus algorithm. □ 

Finally, we present an asymmetric solution to election (and consensus) us- 
ing a single atomic register. Asymmetric algorithms allow comparisons between 
process id’s (we assume a total order), however, since the identities of the par- 
ticipating process are not initially known to all processes, the trivial solution 
where all processes choose process 1 as a leader is excluded. 

Theorems. For any finite concurrency c and [c, oo]-participation there is an 
adaptive asymmetric solution to election and consensus using one register. 
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Shared: 

{Leader^ Marked): (Process id, boolean), initially (0,0) 
Union: set of at most c process ids, initially 0 

Local: 

localJeader: Process id 

locaLunionl: set of at most c process ids 

locaLunion2: set of at most c process ids, initially {z} 

1 if Marked = 0 then (Leader, Marked) := (z, 0) fi 

2 local-unionl := Union 

3 while \local-unionl\ < c do 

4 if -i(local-union2 C locaUunionl) then 

5 U nion := locaUunionl U local-union2 

6 fi 

7 local janion2 := local janionlVJ local janion2 

8 local janionl := Union 

9 od 

10 localJeader := Leader 

11 (Leader, Marked) := (localJeader, 1) 

12 ret urn (Leader) 

Fig. 2. Symmetric election for concurrency c 



Proof. The algorithm in Figure 3 is an adaptation of the two register algorithm 
for the symmetric case (Theorem 4), where only a single register is used. As 
before, each of the hrst c processes registers itself in the register and repeatedly 
updates Union with any new information it has obtained (as done in the repeat 
loop of the symmetric algorithm). The elected process is the process with the 
minimum id among the first c processes to participate in the algorithm. □ 

3 Atomic registers: participation is not required 

3.1 Consensus, election and mutual exclusion for concurrency c 

We next consider the number and size of registers required to solve consensus, 
election and mutual exclusion in a model in which participation is not required 
(i.e., [1, oo]-participation). 

Theorem 6. For concurrency level c, the number of atomic registers that are: 

(1) necessary and sujficient for solving deadlock-free mutual exclusion, is c; 

(2) necessary and sujficient for solving election, is logc+ 1; 

(3) sujficient for solving consensus, is logc+ L 

The proof, which consists of minor modihcations of known results, is omitted. 
Theorem 6 implies that when the concurrency level is not finite, an infinite 
number of registers are necessary for solving the election and mutual exclusion 
problems. But are an infinite number of registers sufficient? (We observe, based 
on a result from [YA94] , that if we restrict the number of processes that can try 
to write the same register at the same time, the answer is no.) 
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Process Cs program 

Shared: {Union^ Sig^ Ack): Union a set of at most c process id’s, 

0 < Sig < c, Ack G {0, 1}, initially (0, 0, 0) 

Local: locaUunionl: set of at most c process id’s 

local -union2\ set of at most c process id’s, initially {z} 
myrank^ locaUsigl^ locaUsig2: integers between 0 and c, initially 0 
local -ackl: Boolean, initially 0 
leader: process id 

1 {local -unionl, locaUsigl^ locaUackl) := {U nion^ Sig^ Ack) 

2 while \local-unionl\ < c do 

3 if -i{local-union2 C locaUunionl) then 

4 (U nion^ Sig^ Ack) := {locaUunionl U local-union2, 0, 0) 

5 fi 

6 local-union2 := locaUunionl U locaUunion2 

7 {local -unionl^locaUsigl^locaUackl) := {Union^ Sig^ Ack) 

8 od 

9 local-union2 := locaUunionl 

10 leader := min{g : g G locaUunionl} 

11 if i ^ local-unionl then 

12 return(leader) 

13 elseif i ^ leader then 

14 myrank := \{h G locaUunionl : h < i}\ 

15 while locaLsigl < c do 

16 if {\local-unionl\ < c) then {Union^ Sig^ Ack) := {local-union2,0,0) 

17 elseif {localsigl = myrank) A {locaUackl = 1) then 

18 {Union^ Sig^ Ack) : = {local-union2, myrank, 0) 

19 fi 

20 {local -unionl, locaUsigl, locaUackl) : = {U nion, Sig, Ack) 

21 od 

22 return{leader) 

23 else /* \local-union2\ = c, z = leader */ 

24 while locaLsigl < c do 

25 if {\locaLunionl\ < c) V (/oca/_szgl < locaLsig2) then 

26 {Union, Sig, Ack) := {locaLunion2, locaLsig2, locaLack2) 

27 elseif locaLackl = 0 then /* Signal acknowledged */ 

28 locaLsig2 := locaLsig2 + 1 

29 {Union, Sig, Ack) := {locaLunion2, locaLsig2, 1) /* Send signal. */ 

30 fi 

31 {locaLunionl, locaLsigl, locaLackl) := {U nion, Sig, Ack) 

32 od 

33 return(z) 

34 fi 

Fig. 3. (Asymmetric) election for concurrency c 
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3.2 Relating participation and concurrency 

The following theorem unifies the results of Theorems 5 (participation is re- 
quired) and 6 (participation is not required), demonstrating a relation between 
participation and concurrency in this case. 

Theorem?. For concurrency c with [i ^ co]- parhctpation where I < c^ 
log(c — f -h 1) + 2 registers are sufficient for solving election. 

Proof First, we use the single register algorithm (Theorem 5) as a filter. Here, 
instead of choosing a single leader, up to c — i-\-l processes may be elected. These 
processes continue to the next level. To implement this, we slightly modify the 
single register algorithm. A process, needs to wait only until it notices that i 
(instead of c) processes are participating, and if it is the biggest among them it 
continues to the next level. Thus, at most c — i 1 (and at least one) processes 
continue to the next level. In that level, they compete using the algorithm of 
Theorem 6 (no change is needed in that algorithm), until one of them is elected. 
This level requires log(c — i 1) 1 registers. □ 



3.3 Starvation-free algorithms for unbounded concurrency 

Theorem 6 leads naturally to the question of whether an infinite number of regis- 
ters suffice for solving mutual exclusion with unbounded concurrency, when par- 
ticipation is not required. We present two algorithms that answer this question 
affirmatively. The first is an adaptive and symmetric algorithm using infinitely 
many infinite-sized registers. The second is neither adaptive nor symmetric, but 
uses only (infinitely many) bits. 

Theorems. There is an adaptive symmetric solution to election^ consensus and 
starvation-free mutual exclusion for unbounded concurrency using an infinite 
number of registers. 

Proof. These problems can all be solved by simple adaptations to the deadlock- 
free mutual exclusion algorithm presented in Figure 4. This algorithm has three 
interesting properties: it works assuming an unbounded concurrency level, it is 
adaptive - its time complexity is a function of the actual number of contending 
processes, and it is symmetric. Except for the non- adaptive algorithm in the next 
subsection, we know of no mutual exclusion algorithm (using atomic registers) 
satisfying the first property. In this algorithm, the processes compete in levels, 
each of which is used to eliminate at least one competing process, until only one 
process remains. The winner enters its critical section, and in its exit code it 
publishes the index to the next empty level, so that each process can join the 
competition starting from that level. 

The adaptive deadlock-free algorithm above is easily modified using standard 
“helping” techniques to satisfy starvation freedom. Because the number of pro- 
cesses is infinite, a global diagonalization is necessary instead of a round-robin 
schedule: a process helps others in the order given by an enumeration in which 
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Process Cs program 
Shared: 

next: integer, initially 0 

r[0..oo]: array of integers (the initial values are immaterial) 

6[0..oo], y[0..oo], ^[O..oo]: array of boolean, initially all 0 
Local: 

level: integer, initially 0 
win: boolean, initially 0 

1 start: level := next 

2 repeat 

3 x[level] := i 

4 if y\level\ then h\level] := 1 

5 await level < next 

6 goto start fi 

7 y[level] := 1 

8 if x[level] ^ i then await (b[level] = 1) V (z[level] = 1) 

9 if z[level] = 1 then await level < next 

10 goto start 

11 else level := level + 1 fi 

12 else z[level] := 1 

13 if b[level] = 0 then win := 1 

14 else level := level + 1 fi fi 

15 until win = 1 

16 critical section 

17 next := level + 1 

Fig. 4. Adaptive deadlock-free mutual exclusion for unbounded concurrency 



every process id appears infinitely often. That is, processes set flags when they 
leave the remainder section, and before leaving the critical section, a process 
examines the flag of the next process in the diagonalization and grants the crit- 
ical section if it determines the associated process is waiting. (Global variables 
recording progress in this globalization are maintained via the mutual exclu- 
sion of the critical section.) Starvation-freedom follows if each process appears 
infinitely often in the diagonalization. Even simpler, standard modifications con- 
vert the above algorithm to solve leader election or consensus. □ 

The question of designing an adaptive mutual exclusion algorithm, was first 
raised in [MT93], where a solution was given for a given working system, which is 
useful provided process creation and deletions are rare (the term contention sen- 
sitive was suggested but the term adaptive become commonly used). In [CS94], 
the only previously known adaptive mutual exclusion algorithm was presented, 
in a model where it is assumed that the number of processes (and hence concur- 
rency) is finite. The algorithm exploits this assumption to work in bounded space. 
The algorithm does not work assuming unbounded concurrency. In [Lam87], a 
fast algorithm is presented which provides fast access in the absence of con- 
tention. However, in the presence of any contention, the winning process may 
have to check the status of all other n processes (i.e. access n different shared 
registers) before it is allowed to enter its critical section. A symmetric (non- 
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adaptive) mutual exclusion algorithm for n processes is presented in [SP89]. 

Theorem 9. There is a non-adaptive asymmetric solution to election^ consensus 
and starvation-free mutual exclusion for infinite concurrency using an infinite 
number of bits. 

Process Gs program 
Shared: 

RaceOwner[l..oo], RaceOther[l..oo], iym[l..oo], Lose[l..oo\: boolean, initially 0 
Local: 

index: integer, initially 1 

1 RaceOwner[i] := 1 

2 if RaceOther[i] = 0 then Win[i] := 1 else Lose[i] := 1 fi 

3 repeat forever 

4 RaceOther[index] := 1 

5 if RaceOwner[I ndex] = 1 then 

6 await {Win[Index] = 1 or Lose[Index] = 1) 

7 fi 

8 if Win[index] = 1 then return{index) fi 

9 index := index + 1 

10 end repeat 

Fig. 5. (Non-adaptive) leader election for infinite concurrency using bits 



Proof Figure 5 presents a simple algorithm for election. Modification to solve 
consensus is trivial-starvation-free mutual exclusion can be achieved using tech- 
niques similar to those in the previous algorithm. □ 

4 Test&set bits 

A test&set bit is an object that may take the value 0 or 1, and is initially 
set to 1. It supports two operations: (1) a reset operation: write 0, and (2) a 
test&set operation: atomically assign 1 and return the old value. We hrst make 
the following observation: 

Theorem 10. An infinite number of atomic registers are necessary for imple- 
menting a test&set bit when concurrency is bounded and sujficient when concur- 
rency IS infinite. 

Theorem 11. For solving starvation-free mutual exclusion^ an infinite number 
of atomic bits and test&set bits are necessary and sujficient when the concurrency 
level IS bounded. 

Proof. In [Pet94], it is proved that n atomic registers and test&set bits are nec- 
essary for solving the starvation-free mutual exclusion problem for n processes 
(with concurrency level n). This implies the necessary condition above. The suf- 
hcient condition follows immediately from Theorem 9. □ 
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4.1 Naming 

The following theorem demonstrates that in certain cases a problem is solvable 
assnming bonnded concnrrency, bnt is not solvable assnming nnbounded con- 
cnrrency. The watt-free naming problem is to assign nniqne names to initially 
identical processes. After acqniring a name the process may later release it. A 
process terminates only after releasing the acqnired name. A solntion to the 
problem is reqnired to be wait- free, that is, it shonld gnarantee that every par- 
ticipating process will always be able to get a nniqne name in a finite number of 
steps regardless of the behavior of other processes. (Proof is omitted.) 

Theorem 12. (1) For bounded concurrency, an mfimte number of T&S bits 
are necessary and sujficient for solving watt- free naming, (2) For unbounded 
concurrency, there is no solution to watt-free naming, even when using an mfimte 
number of T&S bits, 

5 Stronger Objects 

5.1 RMW bits 

A read-modify- write object supports a read-modify- write operation, in which it 
is possible to atomically read a value of a shared register and based on the value 
read, compute some new value and assign it back to the register. When assuming 
a fault-free model with required participation, many problems become solvable 
with small constant space. The proofs of the next two theorems, which consists 
of minor modifications of known results, are omitted. 

Theorem 13. There is a solution to consensus with reguired participation, using 
f RMW bits, with mfimte concurrency, even assuming at most one process may 
fail (by crashing). 

We remark that the proof of the theorem above holds only under the assumption 
that it is known that there exists processes with ids 1 and 2. In the algorithm 
in [LA87] (which we modify) only one of these processes may be elected. 

Theorem 14. (1) When only a finite number of RMW registers are used for 
solving starvation-free mutual exclusion with bounded concurrency, one of them 
must be of unbounded size, (2) One unbounded size RMW register is sujficient 
for solving fir st-m, first-out adaptive symmetric mutual exclusion when the con- 
currency level is unbounded, 

5.2 Semaphores 

Given the results so far about starvation-free mutual exclusion, it is natural to 
ask whether it can be solved with bounded space? The answer, as presented in 
[FP87], is that using weak semaphores it can be solved with small constant space 
for unbounded concurrency, but not with infinite concurrency level. 
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The result below refers to u;ea A: semaphores, in which a process that executes 
a V operation will not be the one to complete the next F operation on that 
semaphore, if another process has been blocked at that semaphore. Instead, one 
of the blocked processes is allowed to pass the semaphore to a blocked process. 

Theorem 15 (Priedberg and Peterson [FP87]). flj There is an adaptive 
symmetric solution to starvation-free mutual exclusion using two atomic bits 
and two weak semaphores, when the concurrency is unbounded. (2) There is no 
solution to the starvation-free mutual exclusion problems using any finite number 
of atomic registers and weak semaphores, when the concurrency is infinite. 
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Abstract. Conventional mechanisms for electronic commerce provide 
strong means for securing transfer of funds, and for ensuring such things 
as authenticity and non-repudiation. But they generally do not attempt 
to regulate the activities of the participants in an e-commerce transac- 
tion, treating them, implicitly, as autonomous agents. This is adequate 
for most cases of client-to- vendor commerce, but is quite unsatisfactory 
for inter- enterprise electronic commerce. The participants in this kind 
of e-commerce are not autonomous agents, since their commercial activ- 
ities are subject to the business rules of their respective enterprises, and 
to the preexisting agreements and contracts between the enterprises in- 
volved. These policies are likely to be independently developed, and may 
be quite heterogeneous. Yet, they have to interoperate^ and be brought 
to bear in regulating each e-commerce transaction. This paper presents 
a mechanism that allows such interoperation between policies, and thus 
provides for inter- enterprise electronic commerce. 



1 Introduction 

Commercial activities need to be regulated in order to enhance the confidence of 
people that partake in them, and in order to ensure compliance with the various 
rules and regulations that govern these activities. Conventional mechanisms for 
electronic commerce provide strong means for securing transfer of funds, and 
for ensuring such things as authenticity and non-repudiation. But they generally 
do not attempt to regulate the activities of the participants in an e-commerce 
transaction, treating them, implicitly, as autonomous agents. 

This is adequate for most cases of client-to-vendor commerce, but is quite 
unsatisfactory for the potentially more important inter- enterprise (also called 
business-to-business or B2B) electronic commerce^ . The participants in this kind 
of e-commerce are not autonomous agents, since their commercial activities are 
subject to the business rules of their respective enterprises, and to the preexisting 
agreements and contracts between the enterprises involved. The nature of this 
situation can be illustrated by the following example. 

Work supported in part by DIMACS under contract STC-91-19999 and NSF grants 
No. CCR-9626577 and No. CCR-9710575 

^ At present, B2B accounts for over 80% of all e-commerce, amounting to $150 billion 
in 1999 (cf. [5]). It is estimated that by 2003 this figure could reach $3 trillion. 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 179-193, 2000. 

(c) Springer- Verlag Berlin Heidelberg 2000 
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Consider a purchase transaction between an agent xi of an enterprise Ei 
(the client in this case), and an agent X2 of an enterprise E2 (the vendor). Such 
a transaction may be subject to the following set of policies: 

1. A policy Vi that governs the ability of agents of enterprise E\ to engage 
in electronic commerce. For example, V\ may provide some of its agents 
with budgets, allowing each of them to issue purchase orders only within the 
budget assigned to it^. 

2. A policy V2 that governs the response of agents of enterprise E2 to purchase 
orders received from outside. For example, V2 may require that all responses 
to purchase orders should be monitored — for the sake of internal control, say. 

3 . A policy V\^2 that governs the interaction between these two enterprise, 
reflecting some prior contract between them — we will call this an “interaction 
policy” . For example, V\^2 may reflect a blanket agreement between these two 
enterprise, that calls for agents in E2 to honor purchase orders from agents 
in F^i, for up to a certain cumulative value — to be called the “blanket” for 
this pair of enterprises. 

Note that policies V\ and V2 are formulated separately, without any knowledge 
of each other, and they are likely to evolve independently. Furthermore, E\ may 
have business relations with other enterprises £^3 , . . . , under a set of different 

interaction policies £1,3, 

The implementation of such policies is problematic due to their diversity^ 
their interconneetivity and the large number of participants involved. We will 
elaborate now on these factors, and draw conclusions — used as principles on 
which this work is based. 

First, e-commerce participants have little reason to trust each other to ob- 
serve any given policy — unless there is some enforcement mechanism that com- 
pels them all to do so. The currently prevailing method for establishing e-commerc 
policies is to build an interface that implements a desired policy, and distribute 
this interface among all who may need to operate under it. Unfortunately, such 
a “manual” implementation is both unwieldy and unsafe. It is unwieldy in that 
it is time consuming and expensive to carry out, and because the policy be- 
ing implemented by a given set of interfaces is obscure, being embedded into 
the code of the interface. A manually implemented policy is unsafe because it 
can be circumvented by any participant in a given commercial transaction, by 
modifying his interface for the policy. These observations suggest the following 
principle: 

Principle 1 E-commerce policies should be made explicit, and be enforced by 
means of a generic mechanism that can implement a wide range of policies in a 
uniform manner. 

Second, e-commerce policies are usually enforced by a central authority (see for 
example, NetBill [ 4 ], SET [ 9 ], EDI [ 13 ]) which mediates between interlocutors. 

An agent of an enterprise may be a person or a program. 



2 




Etablishing Business Rules for Inter-Enterprise Electronic Commerce 



181 



For example, in the case of the V \^2 policy above, one can have the blankets 
maintained by an authority trusted by both enterprises which will ensure that 
neither party violates this policy. 

However such centralized enforcement mechanism is not scalable. When the 
number of participants grows, the centralized authority becomes a bottleneck, 
and a dangerous single point of failure. A centralized enforcement mechanism 
is thus unsuitable for B2B e-commerce because of the huge number of of par- 
ticipants involved — large companies may have as many as tens of thousand oj 
supplier- enterprises (cf. [6]). The need for scalability leads to the following prin- 
ciple: 

Principle 2 The enforeement meehanism of e-eommeree polieies needs to he 
deeentralized. 

Finally, a single B2B transaction is subject to a conjunction of several distinct 
and heterogeneous policies. The current method for establishing a set of policies 
is to to eombine them into a single, global super-poliey. While an attractive 
approach for other domains^, combination of policies is not well suited for inter- 
enterprise commerce because it does not provide for the privaey of the interacting 
enterprises, nor for the evolution of their policies. We will now briefly elaborate 
on these issues. 

The creation of a super-policy requires knowledge of the text of sub-policies. 
But divulging to a third party the internal business rules of an enterprise is 
not common practice in todays commerce. Even if companies would agree to 
expose their policies, it would still be very problematic to construct and maintain 
the super-policy. This is because, it is reasonable to assume that business rules 
of a particular enterprise or its contracts with its suppliers — the sub-policies 
in a B2B scenario — change in time. Each modification at the sub-policy level 
triggers, in turn, the modification of all super-policies it is part of, thus leading 
to a maintenance nightmare. We believe, therefore, that it is important for the 
following principle to be satisfied: 

Principle 3 Inter- operation between e-eommerce policies should maintain their 

privacy, autonomy and mutual transparency. 

We have shown in [10] how fairly sophisticated contracts between autonomous 
clients and vendors can be formulated using what we call Law- Governed Inter- 
action (LGI). The model was limited however in the sense that policies were 
viewed as isolated entities. In this paper we will describe how LGI has been 
extended, in accordance with the principles mentioned above, to support policy 
inter-operation. 

The rest of the paper is organized as follows. We start, in Section 2, with a 
brief description of the concept of LGI, on which this work is based; in Section 3 
we present our concept of policy- interoperability. Details of a secure implementa- 
tion of interoperability under LGI are provided in Section 4. Section 5 discusses 
some related work, and we conclude in Section 6. 

^ like for example federation of databases, for which it was originally devised 
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2 Law-Governed Interaction (LGI) — an Overview 

Broadly speaking, LGI is a mode of interaction that allows an heterogeneous 
group of distributed agents to interact with each other, with confidence that an 
explicitly specified set of rules of engagement-called the law of the group — is 
strictly observed by each of its member. Here we provide a brief overview of 
LGI, for more detailed discussion see [10, 11]. 

The central concept of LGI is that of a policy defined as a four-tuple: 

(M,g,CS, £} 

where A4 is the set of messages regulated by this policy, Q is an open and 
heterogeneous group of agents that exchange messages belonging to M; CS is a 
mutable set {CSx \ x G of what we call control states^ one per member of 
group G; and C is an enforced set of “rules of engagement” that regulates the 
exchange of messages between members of Q. We will now give a brief description 
of the basic components of a policy. 

The Law: The law is defined over certain types of events occuring at members of 
0, mandating the effect that any such event should have — this mandate is called 
the ruling of the law for a given event. The events thus subject to the law of a 
group under LGI are called regulated events — they include (but are not limited 
to) the sending and arrival of 7^-messages. 

The Group: We refer to members of G as agents^ by which we mean autonomous 
actors that can interact with each other, and with their environment. Such an 
agent might be an encapsulated software entity, with its own state and thread 
of control, or it might be a human that interacts with the system via some 
interface. (Given popular usage of the term “agent” , it is important to point out 
that this term does not imply here either “intelligence” nor mobility, although 
neither of these is ruled out.) Nothing is assumed here about the structure and 
behavior of the members of a given £-group, which are viewed simply as sources 
of messages, and targets for them. 

The Control State: The control-state CSx of a given agent x is the bag of at- 
tributes associated with this agent (represented here as Prolog-like terms). These 
attributes are used to structure the group 5, and provide state information about 
individual agents, allowing the law jC to make distinctions between different 
members of the group. The control-state CSx can be acted on by the primitive 
operations, which are described below, subject to law C. 

Regulated Events: The events that are subject to the law of a policy are called 
regulated events. Each of these events occurs at a certain agent, called the home 
of the event^ The following are two of these event-types: 

^ strictly speaking, events occur at the controller assigned to the home-agent. 
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1. sent(x,m,y) — occurs when agent x sends an ^-message m addressed to y. 
The sender x is considered the home of this event. 

2. arrived(x,m,y) — occurs when an ^-message m sent by x arrives at y. The 
receiver y is considered the home of this event. 



Primitive Operations: The operations that can be included in the ruling of the 
law for a given regulated event e, to be carried out at the home of this event, 
are called primitive operations. Primitive operations currently supported by LGI 
include operations for testing the control-state of an agent and for its update, 
operations on messages, and some others. A sample of primitive operations is 
presented in Figure 1. 



Operations on the control-state 



toes returns true if term t is present in the control state, and fails otherwise 

+t adds term t to the control state; 

-t removes term t from the control state; 

tl^t2 replaces term tl with term t2; 

incr(t(v) ,d) increments the value of the parameter v of term t with quantity d 
dcr(t(v) ,d) decrements the value of the parameter v of term t with quantity d 



Operations on messages 

forward (x,m,y) sends message m from x to y; triggers at y an arrived(x,m,y) event 
deliver (x,m,y) delivers to agent y message m (sent by x) 



Fig. 1. Some primitive operations 



The Law-Enforeement Meehanism: Law is enforced by a set of trusted enti- 
ties called eontrollers that mediate the exchange of 7^-messages between mem- 
bers of group Q. For every active member x in 5, there is a controller Cx logically 
placed between x and the communications medium. And all these controller carry 
the same law C. This allows the controller Cx assigned to x to compute the ruling 
of C for every event at x, and to carry out this ruling locally. 

Controllers are generic^ and can interpret and enforce any well formed law. 
A controller operates as an independent process, and it may be placed on the 
same machine as its client, or on some other machine, anywhere in the network. 
Under Moses (our current implementation of LGI) each controller can serve 
several agents, operating under possibly different laws. 

3 Interoperability Between Policies 

In this section we introduce an extension to LGI framework that provides for 
the interoperability of different and otherwise unrelated policies. This section is 
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organized as follows: in Section 3.1 we present our concept of interoperability. 
In Section 3.2 we describe an extension of LGI that supports this concept; the 
extended LGI is used in Section 3.3 to implement a slightly refined version of the 
motivating example presented in Section 1 . We conclude this Section by showing 
how privacy, autonomy and transparency of interoperating policies are achieved 
in this framework. 

3.1 A Concept of Interoperability 

By “interoperability” we mean here, the ability of an agent xjV (short for “an 
agent x operating under policy 7^”) to exchange messages with yj Q, were V and 
Q are different policies^, such that the following properties are satisfied: 

consensus: An exchange between a pair of policies is possible only if it is au- 
thorized by both. 

autonomy: The effect that an exchange initiated hy xjV may have on the 
structure and behavior of y/ Q, is subject to policy Q alone, 
transparency: Interoperating parties need not to be aware of the details of 
each other policy. 

To provide for such an inter-policy exchange we introduce into LGI a new prim- 
itive operation — export — and a new event — imported, as follows: 

— Operation export (x/7^,m,y/Q) , invoked by agent x under policy initi- 
ates an exchange between x and agent y operating under policy Q. When the 
message carrying this exchange arrive at y it would invoke at it an imported 
event under Q. 

— Event import ed(x/7^,m,y/Q) occurs when a message m exported hy xjV 
arrives at y/ Q. 

We will return to the above properties in Section 3.4 and show how they are 
brought to bear under LGI. 

3.2 Support for Interoperability under LGI 

A policy V is maintained by a server that provides persistent storage for the law 
C of this policy, and the control-states of its members. This server is called the 
secretary of to be denoted by 5-p. In the basic LGI mechanism, the secretary 
serves as a name server for policy members. In the extended model it acts also 
as a name server for the policies which inter-operate with V. In order to do so 
5-p maintains a list of policies to which members of V are allowed to export to, 
and respectively import from (subject of course to £-p). For each such policy 7^^, 
5-p records among other information the address of 5-p/ . 

For an agent x to be able to exchange 7^-messages under a policy 7^, it needs to 
engage in a connection protocol with the secretary. The purpose of the protocol 

^ It is interesting to note that x and y may actually be the same agent. 
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Fig. 2. Policy interoperation 



is to assign x to a controller Cx which is fed the law of V and the control state 
of X (for a detailed presentation of this protocol the reader is referred to [11]). 

To see how an export operation is carried out, consider an agent x operating 
under policy which sends a message m to agent y operating under policy 
Q assuming that x and y have joined the policy respectively Q (Figure 2). 
Message m is sent by means of a routine provided by the Moses toolkit, which 
forwards it to Cx — the controller assigned to x. When this message arrives at Cx^ 
it generates a sent(x,m,y) event at it. Cx then evaluates the ruling of law 
for this event, taking into account the control-state CSx that it maintains, and 
carries out this ruling. 

If this ruling calls the control-state CSx to be updated, such update is carried 
out directly by Cx- And if the ruling for the sent event calls for the export of 
m to y, this is carried out as follows. If Cx does not have the address of (7^, 
the controller assigned to y, it will ask Sj> for it. When the secretary responds, 
Cx will finalize the export and will cache the address. As such, forthcoming 
communication between x and y will not require the extra step of contacting 
S'p. 

When the message m sent by Cx arrives at Cy it generates an imported (x , m , y) 
event. Controller Cy computes and carries out the ruling of law Cq for this event. 
This ruling might, for example, call for the control- state CSy maintained by Cy to 
be modified. The ruling might also call for m to be delivered to y, thus completing 
the passage of message m. 

In general, all regulated events that occur nominally at an agent x actually 
occur at its controller Cx- The events pertaining to x are handled sequentially in 
chronological order of their occurrence. The controller evaluates the ruling of the 
law for each event, and carries out this ruling atomieally^ so that the sequence of 
operations that constitute the ruling for one event do not interleave with those 
of any other event occuring at x. Note that a controller might be associated with 
several agents, in which case events pertaining to different agents are evaluated 
concurrently. 
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It should be pointed out that the confidence one has in the correct enforce- 
ment of the law at every agent depends on the assurance that all messages are 
mediated by correctly implemented controllers. The way to gain such an assur- 
ance is a security issue we will address in Section 4. 



3.3 A Case Study 

We now show how a slightly refined version of the three policies V\^ 7^2, and 
7^1 ^2, introduced in Section 1, can be formalized, and thus enforced, under LGL 
We note that V\ and V2 do not depend on each other in any way. Each of these 
policies provides for export to, and import from, the interaction policy 7^i,2, but 
they have no dependency on the internal structure of 7^i,2- 

After the presentation of these three policies we will we illustrate the manner 
in which they interoperate by describing the progression of a single purchase 
transaction. We conclude this section with a brief discussion. 



Policy Pi Informally, this policy, which governs the ability of agents of enterprise 
El to issue purchase orders, can be stated as follows: 

For an agent in Ei to issue a purchase order (PO) it must have a budget 
assigned to it, with a balance exceeding the price in the PO. Once a 
PO is issued, the agent’s budget is reduced accordingly. If the PO is not 
honored, for whatever reason, then the client’s budget is restored. 

Formally, under LGI, the components of Vi are as follows: the group Q con- 
sists of the employees allowed to make purchases. The set M. consists of the 
following set of messages: 

— purchaseOrder (specs , price , c), which denotes a purchase order for a mer- 
chandise described by specs and for which the client c is willing to pay 
amount price. 

— supplyOrder (specs , ticket), which represents a positive response to the 
PO, where ticket represents the requested merchandise. (We assume here 
that the merchandise is in digital form, e.g. an airplane ticket, or some kind 
of certificate. If this is not the case, then the merchandise delivery cannot 
be formalized under LGI.) 

— declineOrder (specs, price , reason) denoting that the order is declined 
and containing a reason for the decline. 

The control-state of each member in this policy contains a term budget (val), 
where val is the value of the budget. Finally, the law of this policy is presented in 
Figure 3. This law consists of three rules. Each rule is followed by an explanatory 
comment (in italics). Note that under this law, members of Pi are allowed to 
interoperate only with members of policy 7^i,2- 
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'Initially: A member has in his control state an attribute budget(val), where val represents 
the total dollar amount it can spend for purchases. 

IZl. sent(Xl, purchaseOrder (Specs, Price , XI) ,X2) 
budget (Val) (9CS , Val>Price, 
do (dcr (budget (Val) , Price) ) , 

do (export (Xl/pl , purchaseOrder (Specs , Price , XI) ,X2/pl2)) . 

A purchaseOrder message is exported to the vendor X2 that operates under the 
inter-enterprise policy pl2 — but only if Price, the amount XI is willing to pay 
for the merchandise, is less than Val, the value of the sender’s budget. 

7 Z 2 . import ed (X2/pl2, supplyOrder (Specs, Ticket ), Xl/pl) 

do(deliver(X2, supplyOrder (Specs, Ticket) ,X1) ) . 

A supplyOrder message, imported from pi 2, is delivered without further ado. 

7 Z 3 . imported(X2/pl2, declineOrder (Specs , Price , Reason) , Xl/pl) 
do(incr (budget (Val) , Price)) , 

do (deliver (X2, declineOrder (Specs , Price , Reason) ,X1)) . 

A declineOrder message, imported from pl2, is delivered after the budget is 
restored by incrementing it with the price of the failed PO. 



Fig. 3. The law of policy Vi 



Policy V2 Informally, this policy which governs the response of agents of E2 to 
purchase orders, can be stated simply as follows: 

Each phase of a purchase transaction is to be monitored by a designated 
agent called auditor. 

The components of V2 are as follows: the group Q of this policy consists of 
the set of employees of E2 allowed to serve purchase orders, and of a designated 
agent auditor that maintains the audit trail of their activities. For simplicity, 
we assume here that the set of messages recognized by this policy is the same 
as for policy V\ — this is not necessary, as will be explained later. The law C2 
of this policy is displayed in Figure 4. Note that unlike C\, which allows for 
interoperability only with policy this law allows for interoperability with 
arbitrary policies. (The significance of this will be discussed later.) 

Policy V\^2 We assume that there is a blanket agreement V\^2 between enter- 
prises El and F2, stated, informally, as follows: 

A purchase order is processed by the vendor only if the amount offered 
by the client does not exceed the remaining balance in the blanket. 

The components, under LGI, of this policy are as follows: the group Q consists 
of the set of agents from the vendor-enterprise E2 that may serve purchase 
orders, and a distinguished agent called blanket that maintains the balance for 
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. imported(I/IP,purchaseOrder (Specs, Price, XI) ,X2/p2) 
do (+order (Specs , I , IP) ) , 

do (deliver (X2, purchaseOrder (Specs, Price , XI) , auditor)) , 
do (deliver (XI, purchaseOrder (Specs, Price) ,X2)) . 

When a purchaseOrder is imported by the vendor X2, the message is delivered 
to the intended destination and also to the designated auditor. 

I. sent (X2 , supplyOrder (Specs , Ticket) , XI) 

order (Specs , I , IP) (9CS , do (-order (Specs , I , IP) ) , 
do (export (X2/p2, supplyOrder (Specs, Ticket) ,1/IP)) , 
do (deliver (X2 , supplyOrder (Specs , Ticket ) , auditor) ) . 

A message sent by the vendor is delivered to the auditor. The message is ex- 
ported to 1, the interactant from which this order originally came, under inter- 
action policy IP. (In our case, 1 is the object blanket operating under Vi, 2 , but 
this does not have to be the case, as explained later). 



Fig. 4. The law of policy V 2 



the purchases of the client-enterprise Ei. The law C\^2 of this policy is displayed 
in Figure 5 . 

The Progression of a Purehase Transaetion We explain now how these policies 
function together, by means of a step-by-step description of the progression of a 
purchase transaction initiated by a PO purchaseOrder (specs , price ,xl) sent 
by agent x\ of an enterprise E\ (the client) to an agent X2 of E2 (the vendor). 

1 . The sending by x\ of a PO to X2 is handled by policy V\ (see Rule IZl of Vi) 
as follows. If the budget of x\ is smaller than the specified price, then this 
PO is simply ignored; otherwise the following operations are carried out: (a) 
the budget of x\ is decremented by the specified price; and (b) the PO is 
exported to ^2/7^1, 2, i*e., to agent X2 under policy 7 ^i, 2 - 

2 . The import of a PO into ^2/7^1, 2 forces the PO to be immediately forwarded 
to an agent called blanket. Agent blanket, which operates under 7 ^i, 2 , has in 
its control-state the term balance{val) , where val represents the remaining 
balance under the blanket agreement between the two enterprises (Rule IZl 
of 7^1,2). 

3 . The arrival of a PO at blanket agent causes the balance of the blanket to be 
compared with the price of the PO. If the balance is bigger it is decremented 
by the price, and the PO is exported to the vendor agent X2IV2 (Rule 1 Z 2 
of 7^1,2); otherwise, a dec line Order message is exported back to the client 
xijVi (Rule IZS of 7^1,2)- We will assume for now that the former happen; 
we will see later what happens when a dec line Order message arrives at a 
client. 

4 . When a PO exported by agent blanket (signifying consistency with the blan- 
ket agreement) is imported into ^2/7^2, it is immediately delivered to two 
agents: (a) to the vendor agent X2 himself, for its disposition; and (b) to to 
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'Initially: Agent blanket has in its control state a term of the form balance (val), 
where val denotes the remaining amount of money that the client-enterprise E\ has 
available for purchases, at a given moment in time. 

IZl. imported(Xl/pl ,purchaseOrder (Specs , Price) ,X2/pl2) 

do(forward(X2,purchase0rder (Specs , Price, XI) , blanket)) . 

A purchaseOrder message imported by a vendor X2 is forwarded to blanket for 
approval 

1Z2. arr ived(X2, purchaseOrder (Specs , Price, XI) , blanket) 
balance (Val) (9CS , Val>=Price, 
do (dcr (balance (Val) , Price)) , 
do(+ order (Specs, Price, XI, X2) ) , 

do (export (blanket /p 12, purchaseOrder (Specs, Price, XI) ,X2/p2)) . 

If Price, the sum XI is willing to pay for the merchandise, is less than Val the 
value of the balance, then the purchaseOrder message is exported to X2, the 
vendor which originally received the request under policy p2. 

7Z3. arr ived(X2, purchaseOrder (Specs , Price, XI) , blanket) 
balance (Val) (9CS , VaKPrice, 

do (export (X2/pl2, declineOrder (Specs , Price, ’’insufficient 
funds ’ ’ ) ,Xl/pl) ) . 

If the balance is less than the Price then a declineOrder message is exported 
to XI, the client which originally issued the purchaseOrder. 

7^4. imported(X2/p2,supply0rder(Specs,Ticket) ,blanket/pl2) 

order (Specs, Price, XI, X2)@CS, do (-order (Specs, Price, XI, X2)) , 
do (export (X2/pl2, supply (Specs , Ticket ),Xl/pl)). 

A supply Order message is exported to the client XI which issued the order. 



Fig. 5. The law of policy Vi ^2 



the distinguished agent auditor, designated to maintain the audit trail of 
responses of vendor- agents to purchase orders (Rule fZl of 7^2)- 

5. According to policy 7^2, agent X2 that received a PO can respond by a 
supplyOrder message^ which triggers two operations: (a) the message is 
exported to blanket /V 1^2^ and (b) a copy of this message is delivered to the 
auditor object (Rule 1 Z 2 of 7^2)- 

6. An import of the supplyOrder response of X2IV2 into blanket /V\^2 is au- 
tomatically exported to the client x\jV\ (Rule 7^4 of 7^1,2)- 

7. Finally, the import of a supplyOrder message into x\jV\ causes this message 
to be delivered to x\ (Rule VJl of T^i) , while the import of a declineOrder 



® To keep the example simple we did not describe here the case when the vendor 
decline the PO. 
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message into x\jV\ causes the budget of x\ to be restored, before the mes- 
sage is delivered to it (Rule 1Z3 of Vi). 

Discussion This case study makes the following simplifying assumptions: ( 1 ) all 
three policies use the same set of messages, and ( 2 ) the client-enterprise policy 
Vi allows for interoperation only with Vi^2- These assumptions are not intrinsic 
to the proposed model and were adopted only in order to make the example 
as simple as possible. We will explain now the drawbacks of these assumption 
and show how they can be relaxed, making this case study far more general and 
flexible. 

First, it is unreasonable to assume that completely different enterprises will 
use the same vocabulary with the same semantic. While it is required by the 
model that interoperating policies — in our example Vi and Vi^2 on one side, and 
Vi^2 and V2 on the other — ’’understand” each other messages, policies Vi and 
V2 could have used entirely different messages. The translation from to 

M.V2 generally be done by the intermediate policy 7 ^ 1 , 2 - 

Second, it is unrealistic to assume that an enterprise will purchase merchan- 
dise only from a single vendor as is required by our current Vi — which is 
coded to interoperate only with 7^i,2, representing the contract between E\ and 
E2. In general, one should provide for an agent in E\ to purchase from other 
vendors — say from £^2^ through an inter-enterprise policy £1,2^ reflecting a pre- 
agreement between E\ and £2^- An analogous flexibility is inherent in £2, which 
does not pose any restrictions on the policy it interoperates with, and thus allows 
for establishing contracts with different clients. A similar technique can be used 
in £1 to allow purchasing from any number of vendors. 

3.4 Assurances 

We are in position now to explain how the three properties of our concept of 
interoperability, namely consensus, autonomy and transparency are satisfied 
by LGI mechanism. 

The consensus condition stipulated that interoperation between a pair of 
policies should be agreed by both. This property is satisfied by our implemen- 
tation because for an agent under £ to send a message to an agent under a 
different policy Q, £ must have a rule that invokes an export operation to Q, 
and Q must have a rule that responds to an imported event from £. 

The autonomy condition is satisfied, because the effect on of a message 
imported from elsewhere is determined only by the imported-iules in Q. Finally, 
transparency is satisfied because, when an agent y/Q handles a message exported 
from x/£, it has access only to the message itself and to its source, but not to 
the policy £ under which it has been produced. 

4 A Secure Implementation of Interoperability 

To prevent malicious violations of the law, the following conditions have to be 
met: ( 1 ) messages are sent and received only via correctly implemented con- 
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trollers, and (2) messages are securely transmitted over the network. The first 
of these conditions can be be handled at different levels of security. First, con- 
trollers may be placed on trusted machines. Second, controllers may be trusted 
when built into physically secure coprocessors [12]. 

To ensure condition (2) above we devised and implemented in Moses toolkit 
the controller-controller authentication protocol displayed in Figure 6. The pur- 
pose of the protocol is twofold. First, it has to ensure that messages are se- 
curely transmitted over the network — which is a problem traditionally solved by 
authentication protocols. The second, more challenging goal is to authenticate 
communicating controllers as genuine controllers operating under inter operating 
policies. 

This protocol assumes that any controller C has a pair of keys (Kc, K^^), 
where Kc is the public key and is assumed to be known by the trusted authority, 

and is the private key, and therefore known only by itself. Also the 

protocol assumes that if C is assigned a member in a policy then C maintains 
a list of the policies which inter-operates with V. For every such policy in this 
list, C records its identifier id (7^^), the hash of the law H(£-p/), and the address 
of the secretary of Vh In the current implementation, this information is given 
to C by the secretary of V at the time a member in V is assigned to C. 



(1) a -^Cy : 


X, 




id(P),H(£p), 




{controller, Kcx}k-i > >> H(£p), H(£q)) 

T Cx 


(2) Cy : 


{controller, Kcyl^-i 







Fig. 6. Controller-controller authentication protocol 



The protocol describes the necessary steps that have to be taken when a 
controller Cx sends a message m, on behalf of a member x operating in policy 
to another controller Cy assigned to a member y in policy Q. In the first step of 
the protocol, Cx sends to Cy a message consisting of x, m, y, and an index number 
i. The index i is used to prevent replay attacks and it is maintained by both 
Cx and Cy. In order to identify to Cy the policy V to which x belongs, Cx also 
transmits id{V) the (unique) identifier of V and the hash of £-p. To authenticate 
itself to Cy as a genuine controller, Cx sends to Cy its public certificate along 
with the signature of a a message consisting of x, m, y, i, and the hashes of 
sender and destination laws. 

Now, when controller Cy receives the message it first checks whether y is al- 
lowed to import messages sent by members in V policy-group. If this is the case, 

^ For simplicity we assume here a unique certifying authority; the protocol could be 
easily extended to support a hierarchy of such authorities. 
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Cy recovers the public key of Cx^ from the certificate and verifies the signa- 
ture. If the signature is correct^ Cy is convinced that it is communicating with a 
genuine controller, because Cx proved it knows which is authenticated by the 
certifying authority. The signature also proves that the message was sent under 
Cj> and knowledge of £q. If all conditions are met then an import (x/7^,m,y/Q) 
event will be triggered at Cy. 

In the second step of the protocol, Cy acknowledges receiving the message by 
sending to Cx the signature of the index number i, and the hash of the law £-p, 
together with its own certificate. After Cx verifies the signature, it is assured 
that message m arrived correctly. Moreover, it trusts that it is talking with a 
genuine controller because Cy proved to know key By comparing the hash 
of the law received with its own Cx can decide whether Cy operates under the 
law it is expected to. 



5 Related Work 

The fact that participants in an electronic transaction have different policies, and 
the importance of finding a common ground between them has been recognized 
by several researchers. Ketchpel and Garcia-Molina [8] studied the transactions 
that occur between a customer who buys items from different vendors through 
brokers. The integrity of such transactions is ensured by trusted agents placed 
between every two principals. Their role is to generate a transaction protocol 
which satisfies the policies of the two principals. The protocol is automatically 
generated using a technique called graph sequencing. This is an effective tech- 
nique, but is limited to individual client-vendor situation, in the sense that a 
particular client (or vendor) is not bound by an enterprise policy. 

Abiteboul, Vianu, Fordham and Yesha [1] propose that the transactions be- 
tween a client and a vendor be mediated by relational transducers. Generally, 
such a transducer implements the vendor policy, but their mechanism allows 
for the modification of the policy. This suggests, that in principle it should be 
possible that a client may add its own policy. However, such a composition of 
policies is computationally expensive to enforce — it is undecidable in the general 
case. 

Composition of policies in the context of access control has been studied by 
several authors: Gong and Qian [7] achieve policy interoperation by inferring a 
composed policy based on (compatible) sub-policies. Another approach, which 
allows for inter-operation of not necessarily compatible policies is policy com- 
bination [3]. Finally, we are mentioning the hierarchical composition of policies 
presented in [2] . These approaches rely on the assumption that there is a higher 
authority which is aware of all sub-policies. Such solutions are not applicable to 
B2B commerce since there is currently no such authority. 

® For simplicity, we don’t discuss here the case V is not authorized to export messages 
to Q policy-group, or the signature is incorrect. Suffices to say that if this is the case 
Cy will notify Cx^ which in turn will notify x. 
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6 Conclusion 

This paper addressed the issue of inter-enterprise electronic commerce, which 
may be subject to a combination of several heterogeneous policies formulated 
independently by different authorities. Starting from a mechanism, such as LGI, 
that supports a formal and enforced concept of a policy, we have argued that such 
an inter-enterprise commerce requires distinct policies to be able to interoperate, 
while maintaining mutual transparency, and without loosing their autonomy. We 
have shown how such a concept of policy-interoperation is implemented in LGI, 
in a secure and scalable manner, and we have demonstrated the application of 
this facility for inter-enterprise electronic commerce. 
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Abstract. Metering schemes are cryptographic protocols to count the 
number of visits received by web sites. These measurement systems are 
used to decide the amount of money to be paid to web sites hosting ad- 
vertisements. Indeed, the amount of money paid by the publicity agencies 
to the web sites depends on the number of clients which visited the sites. 
In this paper we consider a generalization of the metering scheme pro- 
posed by Naor and Pinkas [5]. In their scheme a web site is paid if and 
only if it has been visited by at least a certain number, say /i, of clients. 
In our scheme there are two thresholds, say i and /?, with / < /?. If a web 
site is visited by at most i clients then the web site receives no money. 
If it receives at least h visits then it receives a full payment. Finally, 
if it receives a number / of visits comprised between I -\- 1 and h — 1 
then it receives a partial payment which depends on /. We provide lower 
bounds on the size of the information distributed to clients and to servers 
by metering schemes and present a scheme which achieves these lower 
bounds. 



1 Introduction 

Advertisement payments are one of the major source of revenue for the web sites. 
The amount of money charged to display ads depends on the number of visits 
received by the web site. Web advertisers measure the exposure of their ads by 
obtaining usage statistics about web sites which host their ads. Consequently, 
advertisers should prevent the web sites from inflating the count of their visits 
in order to demand more money. For that reason, there should be an audit 
agency which provides valid and accurate usage measurements of the servers 
(web sites). To this aim, the audit agency should dispose of a system to measure 
the interaction between servers and clients which is secure against fraud attempts 
by the servers and by the clients which visit the web sites. The cryptographic 
protocol which provides such a system is called metering scheme. 

Franklin and Malkhi [4] were the first to consider the metering problem in a 
rigourous theoretical approach. Their solutions offer only a “lightweight security’^ 
and cannot be applied if servers and clients have a strong commercial interest 
to falsify the metering results. 

Subsequently, Naor and Pinkas [5] have proposed secure metering schemes 
as a mean to prevent web servers from inflating the count of their visits. They 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 194-208, 2000. 
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contemplated a scenario in which there are coalitions of corrnpt servers and 
clients which cooperate in order to inflate the connt of visits received by corrnpt 
servers. Moreover, the schemes proposed by Naor and Pinkas [5] protect servers 
from clients which attempt to disrnpt the metering process. In particnlar, they 
have considered metering schemes where a server is able to compnte its proofs 
for a certain time frame if and only if it has been visited in that time frame 
by a nnmber of clients larger than or eqnal to some threshold h. The metering 
schemes proposed in [5] are efficient and provide a short proof for the metered 
data. In their schemes a server which has received a nnmber of visits less than 
h is in the same sitnation as a server which has received no visit. Conseqnently, 
the andit agency will pay nothing to a server which has been visited by less 
than h clients. The metering scheme in [5] is snpposed to be operating for at 
most r time frames and dnring these time frames is secure. A metering scheme 
is considered secnre at a certain time frame t if any server which is visited by 
less than h clients at that time frame has no information abont its proof. 

In order to have a more flexible payment system which enables to connt the 
exact nnmber of visits that a server has received in any time frame, we introdnce 
metering schemes with pricing. In these schemes there are two thresholds i and 
/i, where i < h < n, and any server can be in three different sitnations in a given 
time frame t: 1) the server is visited by a nnmber of clients greater than or eqnal 
to h; 2) the server is visited by a nnmber of clients smaller than or eqnal to i; 3) 
the server is visited by a nnmber of clients comprised between i 1 and h — 1. 
The andit agency wonld pay all the negotiated amonnt for the exposnre of the 
ads in case 1); it wonld pay nothing in case 2); and it wonld pay a smaller snm in 
case 3). For any server and for any time frame there is a proof associated to any 
nnmber of client visits comprised between i 1 and h. Hence, the andit agency 
conld pay a certain snm, growing with the nnmber of the visits, in case 3). 

Metering schemes involve distribnting information to clients and servers. In 
the model we consider the clients receive a certain amonnt of information from 
the andit agency and nse this information to compnte the information passed to 
the servers when visiting them. Obvionsly, snch information distribntion affects 
the overall commnnication complexity. A major goal is to constrnct metering 
schemes whose overhead to the overall commnnication is as small as possible. 
With this motivations, we decided mainly to focns on the size of the information 
received by clients and servers in metering schemes, as well as on the size of 
the proof each server compntes and sends to the andit agency. In this paper we 
provide lower bonnds on the size of the information distribnted to parties and 
we present a scheme achieving these lower bonnds. 



^ In metering schemes, a proof is a value that the server can compute at the end of 
each time frame if and only if it has been visited by a fixed number of clients. Such 
a value, at the end of each time frame, is sent to the audit agency. 
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2 The Model 

In this section we define metering schemes with pricing in terms of entropy. We 
nse the entropy approach mainly becanse this leads to a compact and simple 
description of the schemes and becanse the entropy approach takes into acconnt 
all probability distribntions on the sets of the proofs generated by servers. For 
the reader’s convenience, the notations introdnced in this section are snmmarized 
in Appendix B. 

In this paper with a boldface capital letter, say X, we denote a random 
variable taking valne on a set, denoted with the corresponding capital letter 
X, according to some probability distribntion {Pr^{x)}x^x • The valnes snch a 
random variable can take are denoted with the corresponding lower letter. Given 
a random variable X we denote with H (X) the Shannon entropy of {Pr^ 

(for some basic properties of entropy, consnlt Appendix A) . Let d be an arbitrary 
positive integer and let Xi, . . . , X^^ be d random variables taking valnes on the 
sets Xi, . . . , Xd, respectively. For any snbset G = {/'i, . . . , G} C {1, . . . , d}, with 
h ^ ^ denote with Xy the set X*^ x . . . x X*y and with Xy the 

seqnence of random variables Xq , . . . , XG . 

A metering system consists of n clients, say Fi,...,Fn? servers, say 
and an andit agency A whose task is to measnre the interaction 
between the clients and the servers in order to connt the nnmber of client visits 
that any server receives. Servers which have been visited by at least h clients 
receive a fnll payment of the negotiated amonnt of money, whereas those which 
have received less than i visits receive no money at all. The servers which have 
been visited by a nnmber of clients comprised between i 1 and h — 1 receive 
a partial payment of the negotiated amonnt of money. Snch partial payment 
grows with the nnmber of clients which have been served. To this aim, a server 
which has been visited by a nnmber / of clients comprised between i 1 and h 
shonld be able to provide the andit agency with a proof of the nnmber of visits 
it has received. A server which has been visited by more than h clients wonld 
provide the agency with the same proof it wonld have provided if it had received 
h visits. For any j = 1, . . . , m, t = 1, . . . , r, and i < f < h, we denote with 
Pj j the proof compnted by the server Sj when it has been visited by / distinct 
clients in time frame t. We refer to snch a proof as the /-proof of Sj in time 
frame t. Moreover, we denote with P^ j the set of all valnes that pj j can assnme. 
For any r = £-hl,...,h, we define Lr — {£ + 1, . . . , r} and we denote by pj 
the proofs - Moreover, we denote with the set of all valnes 

that pj can assnme. We also define = 0, for any r < To simplify the 

notation, we define Pj — 0, for any j — 1, . . . , m and t — 1, . . . , r. 

The andit agency provides each client with some information abont the 
servers’ proofs. For any i — l,...,n, we denote with c* the information that 
the andit agency A gives to the client G, and with Ci the set of all possible 
valnes of c* . The information c* is nsed by G lo compnte the information given 
to the servers when visiting them. For any i — 1, . . . , n, j = 1, . . . , m, and 
t — 1, . . . , r, we denote with c\ j the information that the client G sends to the 
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server Sj when visiting it in time frame t. Moreover, we denote with ^ the set 
of all possible valnes of c\ j. We reqnire that for any time frame t = 1, . . . , r, each 
client can compnte the piece to be given to any visited server. More formally, it 
holds that 



H{C^j\Ci) = 0, for i = 1, . . . , n, j = 1, . . . , m, and t = 1, . . . , r. (1) 

For any j = 1, . . . , m and t = 1, . . . , r, we denote with Xj the set of the dj 

client visits received by server Sj in time frame t. We reqnire that, for any time 

frame t — 1, . . . , r and any / = any server which has been visited 

by / different clients in time frame t, can compnte its {I + l)-proof,. . . , /-proof 
for time frame t. More formally, it holds that 



= 0 for i = 1, . . ,,m, t = 1, . . ,,r, and / = £ + 1, . . ,,/i. (2) 

We assnme that a certain nnmber, say c with c < £, of clients and a certain 
nnmber, say s with 5 < m, of servers are corrnpt. A corrupt server can be as- 
sisted by corrnpt clients and other corrnpt servers in order to inflate the connt 
of its visits. A corrnpt client Ci can donate to a corrnpt server the whole in- 
formation Ci received from the andit agency. At time frame t, a corrnpt server 
can donate to another corrnpt server the information that it has received dnring 
time frames 1, . . . , t. For any j = 1, . . . , m and t = 1, . . . , r, we denote with 
all the information known by a corrnpt server Sj in time frames 1, . . . ,t. This 
information inclndes the sets of client visits received by server Sj in time frames 
1, . . . , t. We also define = 0. At time frame t a coalition of s corrnpt servers 
which decide to cooperate disposes of all information contained in 
Vi , Vs^ and of the information provided by the clients visiting snch 

servers dnring time frame t. 

A metering system mnst be secnre against any attempt by corrnpt servers 
to inflate the nnmber of visits they have received. In other words, any a < c 
corrnpt clients collnding with any (d < s corrnpt servers shonld not be allowed 
to infer any information abont the valne of the proofs to provide to the andit 
agency. Formally, let Fq , . . . , Ci^ he a < c corrnpt clients, let Sj ^ , • • • , Sj^ be a 
coalition of 1 < /d < s corrnpt servers, and let B — {ii , . . . , i/?}. Assnme that at 
some time frame t G {1, . . . , r} each server in the coalition has been visited by 
at most z — a clients with z < h. Then, for any / = z l,...,/7, the servers in 
the coalition have no information on their /-proofs. More formally, it holds that 



for z<f<h,t=l 






= H(P* 



, T, 0 < a < c, and dj 



Bj)y (3) 

-\- a < z^ for V — 1, . . . , /?. 



A metering system satisfying (1), (2), and (3) is termed an (£, /z, n, m, r, c, 5 ) 
metering system. A cryptographic protocol realizing snch a metering system is 
called metering scheme with pricing. 

We want to point ont that onr definition of corrnpt servers is slightly different 
from that given by Naor and Pinkas in [5]. Indeed, in their model a corrnpt server 
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can donate to another corrnpt server only the information collected dnring the 
previons time frames, whereas in onr model, which is closer to what can actnally 
happen, a corrnpt server can donate also the information provided by the visits 
received in the cnrrent time frame. 

3 Lower Bounds 

In this section we provide lower bonnds on the size of the information distribnted 
to clients by the andit agency and on the size of the information distribnted to 
servers by clients. 

Since onr goal is to prove a lower bonnd on the size of the information 
distribnted to clients we consider the worst possible case that, at any time frame 
t = 1, . . . , r and for any corrnpt server Sj, the sets . . . , contain the 

maximnm possible information, in other words, corrnpt servers are snpposed 
to receive visits from all clients dnring the previons time frames 1, . . .,t — 1. 
Formally, it holds that 

j = 0, for i = 1, . . . , n, j = 1, . . . , m, and 1 < < t < r — 1. (4) 

Conseqnently, one has 

= 0? for j = 1, . . . , m, and 1 < < t < r — 1. (5) 



3.1 Technical Lemmas 

In order to prove onr lower bonnd on the size of the information distribnted 
to clients, we will resort to the following technical lemmas. The proofs of these 
lemmas are omitted and can be easily derived. 

Lemma 1. Let A and E be two random variables such that 7F(A|E) = 0. Then^ 
for any two random variables F and G, one has iJ(G|AEF) = iJ(G|EF). 

Lemma 2. Let D, E, and F be three random variables such that 7F(F|DE) = 0 
and 7F(F|E) = H{F). It results that 7F(D|E) = H{F) + 7F(D|EF). 

3.2 A Lower Bound on the Size of Clients’ Information 

In this snbsection we present a lower bonnd on the size of the information given 
to clients by the andit agency. 

The next lemma will be nsefnl to prove a lower bonnd on the size of the 
information distribnted to clients. Dne to space constraints, we omit its proof 
which can be fonnd in [2]. 

Lemma 3. Let U be an {£, h, n, m, r, c, 5 ) metering system. Let Ci^, . . . , be 
the corrupt clients and let with \B\ = /S < s^ be a set of indices of corrupt 
servers. For any j ^ B, t = 1, . . . , r, and z — £ — c.....h — c. let X^. e . be a set 
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of visits from z clients other than . . .,Ci^ to server Sj in time frame t. In 
any metering scheme with pricing for U one has 

for r = £ + 1, . . . , /z, i; = 1, . . . , c, and t — 1, . . . , r. 

The following theorem provides a lower bound on the information distributed to 
clients in metering schemes with pricing. 

Theorem 1. Let U he an {£, h, n, m, r, c, 5 ) metering system. Let with \B\ < 
s, be a set of indices of corrupt servers. In any metering scheme with pricing for 
it holds that 

T 

H{Ci) > for anyi=l,..., n. 

t=l 



Proof, W.l.o.g. we will assume that Ci, . . . ,Cc be the corrupt clients and prove 
the bound for Ci, 

The following inequality, which will be proved later, holds. 

H{C,\C2 . . . > H{Pb,lJ + H{Ci\C2 . . . wv|b (6) 

for any t = 1 , . . . , r. 

Starting from 77 (Ci IC 2 . . . CcV^^) and iteratively applying inequality ( 6 ), we 
get 

T 

H{C,\C2 . ..cML) >J2hW*b,lJ +H{Ci\C2 . ..cML)- ( 7 ) 

t=l 

Hence, one has 

H{Ci) > H{Ci IC 2 . . . (from (21) of Appendix A) 

T 

(from (7) and (18) of Appendix A). 

t=i 



Now let us prove inequality ( 6 ). 

For the sake of simplicity and w.l.o.g. we will assume B — {1,...,/?}. For any 
j ^ B and 7 = 1, . . . , r, let Xj be a set of visits from i — c clients other than 
Cl, ... ,Cc to server Sj in time frame t, 

starting from if(Ci IC 2 . . . and iteratively applying 

Lemma 3, we get 

h 

r=t+l 



( 8 ) 
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Let us consider the two random variables A = Lh ^ ~ ' 

Using equations (4) and (5), one can prove that 

= 0 - 

Hence, A and E verify the hypothesis of Lemma 1, and one has 

>if(Ci|C 2 ...CeXy(;,_,)P^_i^v|t (from (21) of Appendix A) 

= iJ(Ci|C 2 ...CcV|t (from Lemma 1). (9) 

It follows that 

if(Ci|C2...c,xy(,_^^p^_^ 

h 

> Y, if(P^..) + if(Ci|C 2 ...CeVW) (from (8)-(9)) 

r=£+l 

>iL(P^_iJ + iL(Ci|C 2 ...CcV|t (from (24) of Appendix A). 

Inequality (6) follows from the above inequality and from (21) of Appendix A 
which implies 

if(Ci|C2 . . > if(Ci|C2 . . 

□ 

In Section 2 we did not make any assumption on the entropies of the random vari- 
ables Pj j and Pj ^^, for j £ {1, . . . , m}, f e {I + 1, ■■■, h}, and f £ {1, . . . , r}. 
Indeed, our results apply to the general case of arbitrary entropies on proofs. 
Now suppose that H{PYJ = 

ji,j 2 e {1, . fi,f 2 ,f e {£ + 1, . . . ,h}, and ti,t 2 £ {1, We de- 

note these common entropies by Lf(P) and 7L(P£.^), respectively. If the proof 
sequences of the s corrupt servers are statistically independent, then Theorem 1 
implies 

7L(C) > stH{Pl^)^ for any client C. (10) 

Moreover, if for any server Sj , the {£ + l)-proof, . . ., h— proof associated to Sj 
are statistically independent, then inequality (10) implies 

7L(C) > sr[h — £)7L(P), for any client C. (11) 

If the proofs of the servers are also uniformly chosen in a finite field F, then 
inequality (11) implies 

^ {h — i)srlog |T|, for any client C. (12) 

This bound is tight, as in Section 4 we present a protocol for an (£, h, n, m, r, c, 5 ) 
metering system in which the audit agency distributes exactly this information 
to clients. 
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3.3 A Lower Bound on the Size of Servers’ Information 

In the following we provide a lower bonnd on the size of the information given 
to servers by clients in metering schemes with pricing. 

The following theorem provides another lower bonnd on the commnnication 
complexity of the metering scheme. It implicitly shows that the size of the in- 
formation each client has to give ont when visiting a server is lower bonnded by 
the size of the proofs the server conld reconstrnct. 

Theorem 2. Let U he an {£, h, n, m, r, c, s) metering system. In any metering 
scheme with pricing for it holds that 

-ff(C-j) > for anyi=l,...,n, j = andt = l,...,T. 

Proof, The following ineqnality, which will be proved later, holds. 

> if(py) + (i3) 

for any i = l,...,n, j = l,...,m, t = l,...,r, r = £-hl,...,/i, and any set of 
visits Xj from r — 1 clients other than Ci to server Sj in time frame t. 

Starting from Li) iteratively applying (13), one gets 



> X if(py) + if(c‘jx 






> H(P] (from (22) and (18) of Appendix A). 



The theorem follows from the above ineqnality and from (21) of Appendix A 
which implies 

Now let ns prove ineqnality (13). 

Let = PJ. D = CN, E = Xj. = Xj. and F = Pj- 

If £ -h 2 < r < /i, then from (2) one has ^(Pj l _i (r-i)^ = 0. If r = £ -h 1 
then Pj = 0 and conseqnently 7L (Pj |^j,(t)) = 0* Hence, one has that the 
random variables A^ and E^ verify the hypothesis of Lemma 1 and conseqnently 
7L(F|A'E0 = H{F\E'). Then, it resnlts that 



(from Lemma 1) 

> -ff(py (from (21) of Appendix A) 

= H{P]p (from (3)). 

From the above inequality and from (19) of Appendix A which implies 
if(Py|Xy^_,^P)_i_J < H{P*p, it follows that 



if(pyixy_i)Pq_j = if(py). 
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Moreover, it results that 



H{P 






r^t jyt 



) < 7J(Pj (from (21) of Appendix A) 

= 0 (from (2)). (15) 



Equations (14)-(15) imply that the random variables D, E, and F verify the 
hypothesis of Lemma 2, and consequently one has iJ(D|E) = H{F)+H{U\F,F). 
Hence, one gets 



= if(py) + 



p* ) 

J,r7 



(from Lemma 2) 



= H-(py) + H-(cy |xy_i)Pqj 

(from (21) of Appendix A). 



Thus, inequality (13) holds. 



□ 



If for any server Sj , the (£+l)-proof, . . ., h-proof associated to Sj are statistically 
independent and uniformly chosen in a finite field T, then Theorem 2 implies 

^ {h — i) log |T|, for i = 1, . . . , n, j = 1, . . . , m, and t = 1, . . . , r. (16) 

This bound is tight, as in Section 4 we present a protocol for an (£, h, n, m, r, c, 5 ) 
metering system in which the clients distribute exactly this information to 
servers. 



4 The Scheme 



In this section we present a scheme for an (£, h, n, m, r, c, 5 ) metering system 
achieving the bounds (12) and (16) of Section 3. Along the same line as Naor 
and Pinkas [5], we use a modified version of Shamir’s secret sharing polynomial 
[7]. The proofs are points of a finite field GF{q) where is a sufficiently large 
prime number. 

In the following we use the term regular visits to indicate visits performed 
by non corrupt clients. Moreover, we denote with “o” an operator mapping each 
pair (i, t), with j — 1, . . . , m and t = 1, . . . , r, to an element of GF{q), and 
having the property that no distinct two pairs { j,t) (i^ are mapped to 
the same element. 
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Initialization: 

The audit agency A chooses h — i random polynomials P£^i{x, y) , . . . , Ph{x , y) over 
GF{q), where, for =£+!,...,/?, the polynomial Pz{x, y) is of degree z — 1 in x and 
degree sr — 1 in y. Then, A sends to each client Ci the h — i univariate polynomials 
!/),•••, Ph{i, y) which are of degree sr — 1. 

Regular Operation for Time Frame t: 

When the client Ci visits the server Sj in time frame t, it sends to Sj the h — I points 
J Ot),. Ph{l,]Ot). 

Proof Generation and Verification: 

Assume that the server Sj has been visited by a number of clients greater than I 
and less than or equal to h in time frame t. Then, the server performs a Lagrange 
interpolation of the polynomial Pz{xA o and computes the value P^(0, o t). This 
value constitutes the 2 ^-proof of Sj, i.e., the proof that Sj has received visits. The 
server Sj sends the pair (P^(0,j' o t),z) to the audit agency. The audit agency can 
verify the proof by evaluating the polynomial Pz{x, y) at the point (0, jot). 



Figure 1. A metering scheme for an {£, h, n, m, r, c, 5 ) metering system. 

Theorem 3. The scheme described in Figure 1 is a metering scheme for an 
{£, h, n, m, r, c, 5 ) metering system. 

Proof. We need to prove that the scheme of Figure 1 satisfies equations (1), 
(2), and (3) of Section 2. It is immediate to verify that the scheme satisfies (1). 
Indeed, for any i — 1, . . . , n, the information given by the audit agency to the 
client Ci consists of the univariate polynomials y), . . . , Ph{i, y), and for 

any j = 1, . . . , m, the information given to the server Sj by client Ci is obtained 
by evaluating the univariate polynomials y), . . . , Ph{i, y) at j o t. 

It is also very easy to verify that the scheme satisfies equation (2). Assume that a 
server Sj has been visited by /, with ^+1 < / < /i, clients at time frame t. Then, 
Sj knows / points of each of the polynomials Pi^i{x, i ot), . . . , Pf{x, j ot). Since 
these polynomials are all of degree less than or equal to / — 1 in x, then the server 
can compute their coefficients by using Lagrange interpolation. In particular, it 
can compute its /-proof for t by evaluating the polynomial Pf{x,j ot) in 0. If 
the server Sj has been visited by a number of clients greater than or equal to 
h in time frame t, then it can reconstruct the h — £ polynomials, i.e., it can 
reconstruct all the proofs for the time frame t. 

Now we need to prove that our scheme satisfies equation (3). We consider the 
worst possible case that at any time frame t = 1, . . . , r, all corrupt clients decide 
to cooperate with all corrupt servers and that corrupt servers have collected the 
maximum possible information during the previous time frames 1, . . . , t — 1. In 
other words, for any time frame t = 1, . . . , r, we assume that each corrupt client 
Ci donates its polynomials Pi^i{i, y), . . . , Ph{i, y) to all corrupt servers, and that 
any corrupt server Sj knows the polynomials Pi^i{x,j ot^), . . . , Ph{x,j ot^), for 
F — 1, . . .,t — 1. In order to prove that our scheme satisfies equation (3), we 
need to prove that for any time frame t = 1, . . . , r, and for any z = £-hl,...,/7. 
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a coalition oi (3 < s corrupt servers , • • • , is not able to calculate the 
proofs o t), . . . , o t) if each server in the coalition receives less 

than z — c regular visits at time frame t. In order to calculate j ot), the 

servers should be able to interpolate either the polynomial Pz{x,j o t) or the 
bivariate polynomial Pz{x, y). Let us suppose that Sj^, . . . , Sj^ be a coalition of 
(3 < s corrupt servers which decide to cooperate in order to inflate the counts of 
their client visits at some time frame with 1 < t < t. The information that a 
corrupt client Ci donates to a corrupt server is equivalent to the sr coefficients of 
each of the polynomials y), . . . , Ph{h v)- For i; = 1, . . . , the information 

collected by each corrupt server during the previous time frames is equivalent 
to the coefficients of the polynomials Pij^i{x^jy o t^), . . . , Ph{x^jv ^ 

— 1. Suppose that at time frame t, the server Sj ^ , v G 
receives regular visits. Then, the overall information on Pz{x, y) held by the 
servers Sj ^ , • • • , consists of 

CST + /?(t — l)z + E 9j^ - c/?C - 1) (17) 

V = 1 

points. The first term of (17) is the information donated by the c corrupt clients, 
the second term is the information collected by all servers in the coalition during 
the previous time frames, the third term is the information provided by the 
client visits at time frame t, and the last term is the information which has been 
counted twice. For any z = i we will prove that the servers in the 

coalition are unable to interpolate the polynomial Pz{x, y) if each server in the 
coalition receives less than z — c regular visits. Notice that for any 1 < /? < s, 
t = 1, . . . , r and z = £+l,...,/7, if , . . . , are all smaller than z — c, 
then expression (17) is strictly less than zsr. Consequently, for any choice of 
a G GF{q) and for any j = 1, . . .,m, there is a polynomial R{x,y) which is 
consistent with the information held by the servers in the coalition and such 
that R{0,j o t) = a. Hence, the corrupt servers have probability at 

most 1/q of guessing the z-proof Pz{0, jv o t), for any v — 1, . . . , /? and any time 
frame t = 1, . . . , r. 

Alternatively, any corrupt server N^y, v = 1,...,/?, might try to calculate its 
z-proof Pz{0,jv by interpolating the polynomial Pz{xGv Notice that 
for any v^w G {1,...,/?}, with w ^ the information held by the server 
is of no help in calculating the polynomial Pz{xGv We will prove that any 
corrupt server , v — 1, ...,/?, is not able to calculate its z-proof 7C(0, jv ^t) if 
it has received less than z — c regular visits. Then, let us assume gj^ < z — c, for 
i; = 1, . . . , /?. Each corrupt client Q donates to Sj^ the polynomial 7C(b y) from 
which Sj^ can calculate the value Pz(i,jv ot). Notice that for any non corrupt 
client Ci, the server Sj^ is not able to evaluate the polynomial Pz(i,y). Indeed, 
in order to evaluate the polynomial iE(b y), N^y should know sr points of this 
polynomial. Hence, Sj^ can calculate only c values of Pz{x,jy ot) in addition to 
those provided by the gj^ visits performed by non corrupt clients. Consequently, 
the overall number of points of Pz{x,jy o t) known to N^y is less than z. Let 
ii, . . . , ic be the indices of the corrupt clients and di, . . . , be the indices 
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of the clients which have visited Sj^ at time frame t. For any choice of a point 
a G GF{q) there is a polynomial (5(^) snch that <5(0) = a and Q{i) = Pz{i, 
for i G {ii, . . . , ic, di, • • • , }. Hence, the server Sj^ has probability at most 

1/q of gnessing its z-proof for time frame t. Moreover, as already observed, for 
any corrnpt server with v G {1, . . . ,/?}, the servers have 

no information on jy o t). Conseqnently, for any v G all corrnpt 

servers in the coalition have probability at most 1/q of gnessing the z-proof of 
Sj^ for time frame t. 

From the above discnssion it follows that both in the case when Sj^ . . . , are 
trying to interpolate the bivariate polynomial Pz{x^y)^ and in the case when 
. . . , are trying to interpolate the polynomial Pz{xG the probability 
that they gness one of the z-proofs (0, ^ t), . . . , j/? o t) is at most 1/q. 

Conseqnently, the probability that a coalition of /S < s corrnpt servers gnesses 
the whole vector (F^ (0, ji o t), . . . , P^ (0, ji 3 o t)) is at most 1/q^ . □ 

Notice that in the above scheme the size of the information given to any client is 
{h — i) ST log q, whereas the size of the information that each server receives from 
a client dnring a regnlar visit is {h — i)logq. It easy to see that this protocol 
achieves the bonnds (12) and (16) of Section 3. 

5 Open Problems 

An interesting open problem is to consider metering systems in which each server 
is associated with a distinct pair of thresholds {£, h). An even more challenging 
problem wonld be to consider a generalization of snch metering systems in which 
the thresholds associated to servers may change dynamically at each time frame. 

Another open problem is to consider different classes of clients. Each class is 
assigned a weight and the amonnt of money paid to servers depends not only on 
the nnmber of clients they served bnt also on the classes those clients belong to. 
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A Information Theory Background 



In this section we review the basic concepts of Information Theory nsed in onr 
definitions and proofs. For a complete treatment of the snbject the reader is 
advised to consnit [3]. 

Given a probability distribntion {Pr^{x)}x^x on a set X, we define the 
entropy ^ of X, denoted by 77(X), as 

H{X) = ~Y,PrA0 ^ogPr^ix). 

xex 

The entropy satisfies the property 0 < X(X) < log |X|, where X(X) = 0 if and 
only if there exists xq ^ X snch that Pr^{xo) = 1; whereas X(X) = Iog|X| if 
and only if Pr^{x) = 1/|X|, for all x C X. 

Given two sets X and Y and a joint probability distribntion on their cartesian 
prodnct, the conditional entropy 77(X|Y), is defined as 

H(X|Y) = -j; Pr^ {y)Pr{x\y) log Pr{x\y). 

y^Y 

From the definition of conditional entropy it is easy to see that 



iF(X|Y) > 0. 



(18) 



The mutual information between X and Y is defined by /(X; Y) = X(X) — 
77(X| Y) and enjoys the following properties: /(X; Y) = I(Y ; X) and /(X; Y) > 
0, from which one gets 



X(X) > X(X|Y). 



(19) 



Given three sets X, Y, Z and a joint probability distribntion on their cartesian 
prodnct, the conditional mutual information between X and Y given Z is 

/(X; Y|Z) = X(X|Z) - X(X|ZY) (20) 



and enjoys the following properties: /(X; Y|Z) = /(Y ; X|Z) and /(X; Y|Z) > 0. 
Since the conditional mntnal information is always non-negative we get 

X(X|Z) > X(X|ZY). (21) 



Given any n > 1 sets, Xi, . . . , Xn and a joint probability distribntion on their 
cartesian prodnct, it holds that 

n 

^if(Xi) >if(Xi...X„). (22) 

i=l 

^ All logarithms in this paper are to the base 2. 
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Given n + 1 sets Xi, . . .,Xn,Y and a joint probability distribntion on 
cartesian prodnct, the entropy of Xi . . given Y can be expressed as 

n 

if(Xi . . ,X„|Y) = if(Xi|Y) + y^if(Xi|Xi . . ,Xi_iY) 

i=2 



and enjoys the following property: 

n 

if(Xi...X„|Y) <y^if(Xi|Y). 

i=l 



their 

(23) 

(24) 
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i,h 


thresholds 


n 


number of clients 


m 


number of servers 


T 


number of time frames 


C 


number of corrupt clients 


s 


number of corrupt servers 


Ci 


information distributed to client Q 


Cl,- 


visit from client Q to server Sj in time frame t 


II 

cq 


indices of the corrupt servers, l3 < s 


pi 


visits from client C>i to servers , Sj^ in time frame t 




visits from dj clients to server Sj in time frame t 


^ki,) 


visits from ^ clients to servers , . . . , Sj^ in time frame t 


J 


information collected by server Sj in time frames 1, . . . 


vW 


information collected by servers Sj ^ , • • • , Sj^ in time frames 1, . . . , t 


pi 

^ jj 


/-proof for server Sj, where / + 


pi 

^ BJ 


/-proofs for servers Sjj ^ , • • • , Sj^ 


— {£ + 1 , . . . , r} 


indices of proofs, where r E {f + 1, . . . , h} 


pi 

^ ./Ar 


(£ -|- l)-proof, . . .r- proof for server Sj 


pi 

^ BXr- 


{i -h l)-proofs,. . . r-proofs for servers Sj ^ , • • • , Sj^ 
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Abstract. A particularly suitable design strategy for constructing a ro- 
bust distributed algorithm is to endow it with a self-stabilization prop- 
erty. Such a property guarantees that the system will always return to 
and stay within a specified set of legal states within bounded time re- 
gardless of its initial state. A self- stabilizing application therefore has 
the potential of recovering from the effects of arbitrary transient fail- 
ures. However j to actually verify that an application self- stabilizes can 
be quite tedious with current proof methodologies and is non-trivial. 
The self-stabilizing property of distributed algorithms exhibits interest- 
ing analogies to stabilizing feedback systems used in various engineering 
domains. In this paper we would like to show that techniques from con- 
trol theory, namely Ljapunov^s ‘‘Second Method,” can be used to more 
easily verify the self- stabilization property of distributed algorithms. 



1 Introduction 

A very promising design strategy for constructing a robust distributed applica- 
tion is to design it as a self- stabilizing algorithm [16]. Informally, an algorithm 
has the self-stabilization property, if - starting from an illegal state - it is guar- 
anteed to return to a specified set of legal states after a finite period of time 
(convergence). Additionally, the set of legal states must be closed under nor- 
mal system execution, meaning that the algorithm does not voluntarily switch 
to any illegal state {closure). The definition of legal and illegal states depends 
on the particular application. Generally, all legal states are specified (e.g., by 
a state predicate) and illegal states are defined to be those states which are 
not legal states. Unfortunately, the verification of self-stabilizing algorithms is a 
complicated task [13]. Consequently, the research community is highly engaged 
in finding more adequate verification techniques. 

The self-stabilization property of a distributed algorithm as described above 
exhibits interesting analogies to stable feedback systems used in various engineer- 
ing domains, like electrical and mechanical engineering. Informally, a feedback 
system is stable, if after a certain finite period of time, the system reaches and 
remains in a pre-defined state [10]. Contrary to the self-stabilization research 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 209-222, 2000. 
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domain, which is a rather new area of research in computer science, control the- 
ory in the engineering domain has a century-old background and offers a broad 
theoretical foundation with powerful criteria for reasoning about the stability of 
feedback systems. 

The aim of our research is to narrow the gap between self-stabilization and 
control theory by adopting criteria originally used for deciding on the stability 
of feedback systems for proving self-stabilization of distributed algorithms. In 
[19] we proved the self-stabilization property of a distributed algorithm by mod- 
eling it as a discrete linear feedback system and subsequently reasoning about 
the location of the roots of their transfer function. But it must be emphasized 
that although the transfer function technique was successful for the particular 
distributed algorithm (a verification of this algorithm based on traditional com- 
puter science techniques has not been achieved yet), it cannot be applied for the 
more general case of non-linear systems. 

In this paper, we present a generalization of our verification technique in 
such a way that it can be adopted even for the non-linear case by the use of 
Ljapunov’s “Second Method” [14]. We would like to draw attention to the fact 
that the new technique in many cases eases the construction of a proof. 

The paper is structured as follows. In the next section, we state the verifi- 
cation problem of self-stabilizing algorithms and describe the verification tech- 
nique traditionally used in computer science. In Sect. 3, we present an alternative 
verification approach. We state our underlying system model, describe how dis- 
tributed algorithms given by guarded commands are mapped to it, and give a 
criterion for reasoning about stability which forms the heart of Ljapunov’s the- 
ory. In Sect. 4, we present a sample algorithm whose self-stabilizing property is 
proven using a new verification technique. Finally, Sect. 5 concludes the paper. 

2 Problem Statement and Traditional Verification 
Technique 

A major problem associated with self-stabilizing distributed algorithms is their 
verification, i.e., the proof that they actually work as required. 

Let C be the set of all possible system states of a system 5. Assume that P 
be a predicate that specifies a subset of C. The sets specified by P are called 
legal states whereas all other states of C are called illegal states. P must be 
guaranteed through the concept of self-stabilization. 

Definition 1 (Self-Stabilization [16]). A system S is self- stabilizing towards 
predicate P on C iff 

(51) if P holds for c e C then P also holds for all subsequent system states^ and 

(52) starting from an arbitrary system state^ S reaches after a finite number of 
steps a system state where P holds. 

A step is an evaluation cycle which leads to a state change. The constraints (SI) 
and (S2) are also called closure and convergencej respectively. 

In order to verify that an algorithm self-stabilizes, both constraints must be 
proven. This is generally done for each constraint separately. 
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Proving closure: A proof of the closure constraint shows that the stability pred- 
icate P is an invariant of the algorithm. Doing so is straightforward: assume P 
holds at the beginning of a cycle. Then is must be shown that any possible step 
taken will again result in a system state where P holds. 

Proving convergence: A proof of the convergence constraint is much more com- 
plicated. Generally, it requires the use of a variant function [15] defined on the 
system state. The values of such a variant function are bounded from below and 
decrease with every step. From such a proof it follows that there will be a point 
in time where the variant function reaches a minimum. When the minimum is 
reached, it must be assured that the system is in a legal state and that switching 
between legal states does not result in a change of the value of the variant func- 
tion. The difficulty of this verification strategy lies in the fact that finding such a 
variant function for a given system requires experience and inspiration since the 
function must in itself bear the “essence of convergence” of the system. Thus, 
deriving a variant function for arbitrary systems is regarded as an art rather 
than a craft. 

A famous example for proving self-stabilization is Dijkstra’s token ring protocol 
for mutual exclusion [4]: although the algorithm was presented in 1974 it took 
12 year before it was finally proven correct (see [5]). It is not claimed that during 
this long period of time researchers worked on a successful proof without any 
interruption, but it strongly indicates that even for a quite simple-looking self- 
stabilizing algorithm, verification is by no means simple. 

It should be noted that advanced techniques have been proposed in order to 
ease program verification. Among them are convergence stairs [8] and eornposi- 
tions [17]. These techniques are built upon the traditional verification technique 
as described above. 

3 An Alternative Verification Technique 

The basic idea of our alternative verification approach is to take advantage of 
a technique called Ljapunov's ^^Second Method^' [14]. Through this method it is 
possible to more easily identify a variant-like function as described above. We 
could observe that in quite a few cases, the technique leads to a quasi auto- 
matic identification of this function. As reported by Kalman and Bertram in 
[11, page 371], the objective of Ljapunov^s “Second Alethod” is to answer the 
questions of stability of difference equations^ utilizing the given form of the equa- 
tions but without explicit knowledge of the solutions. This contrasts Ljapunov’s 
“First Alethod” where explieit representations of those solutions are required. 
Starting point for the “Second Alethod” is the observation that finding a solu- 
tion for a system given as a collection of diflFerence equations is sometimes quite 
complicated if not impossible. But having a solution, stability or instability can 
then easily be proven since it is then straightforward to reason about trajectories 
and equilibrium states of the system in time domain [21]. Through his “Second 
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Method’^ Ljapimov circumvents the problem of finding an explicit solution by 
the following observation: assume the kinetic energy level of a system can be 
described as a function of the system state. Then, a real- world system (e.g., 
a physical system with components subject to friction etc.) which is initially 
started at a certain kinetic energy level will loose kinetic energy as time pro- 
ceeds. At some point in time, regardless of the initial system state (and thereby 
regardless of the amount of kinetic energy present in the system), all of the 
system’s kinetic energy will have left. Consequently, the system will enter and 
remain in a so-called equilibrium state. In other words: the system has stabilized. 
Ljapunov builds upon this basic scheme by generalizing the “notion of kinetic 
energy” such that even systems without energy loss or without a concept of 
kinetic energy can be treated. 

When exploiting this method originally used for feedback systems of the 
engineering domain, for self-stabilizing distributed algorithms of the computer 
science domain, a possible strategy is to model the latter in terms of the former. 
This approach is pursued in the following. 

3.1 System Model 

Figure 1 shows a system model for which Ljapunov’s method can be adopted. We 
will show that it allows the modeling of a distributed algorithm which is given 
by guarded commands and whose self-stabilization property is to be verified. 
The system model is used to represent time-discrete variable structure dynamic 



regulator Rq 




Fig. 1. System model 



systems [20]. The state of such a system at discrete and abstract time k is given 
by the n-dimensional state vector x{k) e Its ftth component is referred to as 
Xi{k)j i — Ij . . . jU. The initial state is given by x(0). Depending on a switching 
function^ given as a scalar function s : Mo , exactly one of p regulators 

i = 0, . . . ,p — 1, is selected at time k and remains active until time fc + 1. 
w € is an n-dimensional vector representing the desired state of the system. 
A regulator Rj is a sub-system with input x{k) and output d {k) — ’ x{k) + 

db d{k)jd^ e are n-dimensional vectors which are time- variant and time- 
invariant, respectively. € M(n x n, M) is a n x n-dimensional time-invariant 
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matrix. When selected at time fc, regulator Ri maps the state vector x(fc) and 
the desired state vector w to the control vector u(fc), i.e., u{k) = w + d(fc). The 
control vector serves as input to the plant The plant is characterized by a system 
of difference equations. The plant’s output at time k is given by the state vector 
x(fc). Depending on the state vector and the control vector at time fc, the plant’s 
output at time fc + 1 evaluates to A ’ x(fc) + B ’ u(fc). A, B € M(n x n,M) are 
time-invariant n x n matrices. A is called the state matrix and B is called the 
control matrix. 



3.2 Matching Distributed Algorithms to the System Model 

We assume that a distributed algorithm is given as a collection of n processes 
Pi,. . . whose program bodies are collections of guarded commands [3]. Fig- 
ure 2 shows a process Pj of such a generic distributed algorithm. The local state 
of a process is defined by a local variable Sj . The initial, but probably arbitrary 
local state is given by Sj,. The communication variables Lij represent local states 
of other processes available to process Pj. Communication is achieved via the 
lookup and modification of these communication variables. While the distributed 
algorithm executes, all of its processes cyclically evaluate their guards. A guard 
is a boolean expression over the local state and communication variables. Guards 
which evaluate to “true” as well as the guarded commands they belong to are 
called active. A global entity, called central daemon selects within each evalua- 
tion cycle a subset of active guarded commands for execution. The nature of this 
subset depends on the central daemon’s strategy. In the scope of this paper, we 
assume serial execution semantics, i.e., the central daemon selects exactly one 
active guarded command if one or more guarded commands are active. This leads 
to an atomic execution of the selected guarded command’s action. Furthermore, 
we assume that the central daemon is starvation free and strongly fair in such 
a way that it selects a guard which is active for an overall infinite number of 
evaluation cycles infinitely often. In the system model, we implement the set of 



process P{ 

var Si init local state 

{* Li^i , . . . , Li^a are communication variables 



begin 

guards 1 ^ action^^ 

[ guards 2 ^ actioni^2 



I guards actioni^n^- 

end 



Fig. 2. Process Pi, i = 1, . . . , n, of a distributed algorithm given by guarded commands 
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local states of all processes (and thereby all communication variables) through 
the state vector x (the superscript “T” indicates a transposed vector or matrix): 

x= (1) 



Its initial value is given by 



X(0) = ,Snf 



( 2 ) 



While the system is running, vector x may change its value as time proceeds. At 
time kj x’s value is given by x(fc) with fc € Nq. 

Assuming serial execution semantics, a particular guarded command - re- 
gardless of which process it originally belonged to - is modeled in two parts: the 
action of the guarded command defines a regulator Ri whereas the guard forms 
part of the switching function s(x(fc)). A generic switching function is 



^(x(fc)) 



1 if guards 1 is active at system state x(fc) and selected 

2 if guardi 2 is active at system state x(fc) and selected 

rrii if guard^^^^ is active at system state x(fc) and selected 
0 otherwise 

(3) 



Assume that a certain guards ^ is associated with value I of the switching func- 
tion. The two parts of a guarded command cooperate such that whenever guards ^ 
evaluates to ‘firue” and is selected by the central daemon then the correspond- 
ing regulator Ri becomes active at that time, leading to an atomic modification 
of the system state which is in correspondence with the guarded command’s 
action^^j. On the contrary, if the guard does not evaluate to ‘firue” then the 
corresponding regulator can never become active and selected in this evaluation 
cycle. As already stated, we assume serial execution semantics in the scope of 
this paper. However, the present model can easily be modified for supporting 
other execution semantics, like the maximum parallel execution semantics^ as 
reported in [7]. 

Depending on the particular distributed algorithm, an additional regulator 
must be modeled and included in the system: assume a system state which does 
not activate any of the guards. In this case, the feedback system would block. 
But since a blocked feedback system cannot be handled by Ljapunov’s ‘^Second 
Method,” we transform the system artifically into a non-blocking system by 
adding a special regulator Rq^ called zero regulator^ which is only triggered if no 

^ According to the maximum parallel execution sernanties^ at most one guarded com- 
mand per proe.es $ is selected at any time k. Such a behavior is achieved by using the 
present switching strategy but with a different switching function. The switching 
function uses additional regulators to implement the parallel execution of several 
actions within a single step. 
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other guard becomes active and therefore possibly selected (0 and 0 indicate a 
zero matrix and a zero vector of adequate dimensions): 

r“ := 0 • x{k) + 0 (4) 

Having a desired state of w = 0 (see below), regulator Rq does no modification 
to the system state, thus it will then be executed over and over again, thereby 
allowing virtual time k to proceed. If a particular distributed algorithm does not 
exhibit blocking system states (such as e.g., Dijkstra’s token ring protocol for 
mutual exclusion with k > n states [4]) then the zero regulator can be omitted. 
In (3), the zero regulator Rq is selected if s{x{k)) evaluates to zero. 

The desired state w represents the system state the system is expected to 
converge to. Consequently, it must bear the characteristics of states specified 
by the predicate P. Note that the presented system model aims towards self- 
stabilizing algorithms with only a single such state. A more general system is 
beyond the scope of this paper but is not complicated: it builds upon a vector 
function defined on the system state vector x. Comparison between system state 
and desired states is then done by means of the vector function and the desired 
state vector which may both be different in dimension than the original state 
vector. But even when restricted to single desired state, it is often possible to 
substantially help constructing the overall proof of self-stabilization of a dis- 
tributed algorithm with several legal states: for DijkstraN token ring protocol 
for mutual exclusion with k > n states, for example, a proof can be given whose 
crucial part shows that a specific legal state is always reached where the bottom 
process exclusively holds a token. Once in this state, it is very simple to prove 
that all other processes will also exclusively own a token at certain times and 
that the set of legal states is never voluntarily left. 

When modeling the distributed algorithm in terms of the system model and 
applying Ljapunov’s “Second Alethod,’^ it must finally be assured that the re- 
sulting system exhibits a unique equilibrium state Xe = w = 0 This can generally 
be achieved through the transformation 

^neAv{l^) (5) 

'^nev: •— Q (fi) 

as well as a corresponding textual modification of the guarded commands (an 
example will be given in Sect. 4). 

The resulting system can then be investigated with respect to system stability 
through a criterion of Ljapunov. 

3.3 Ljapunov’s Stability Criterion [12, 14] 

The transformed system represents a diserete-time^ free^ stationary dynamie sys- 
tem x{k + 1) = A ’ x{k) + B ' u(fc) where A ' 0 + B ’ 0 = 0. Based on this fact, 
the following theorem can be adopted. 

Theorem 1 (Uniform Asymptotical Stability in the Large [12]). Sup- 
pose there exists a sealar funetion V{x{k)) sueh that V^(0) = 0 and 
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(LI) V(x{k)) > 0 when x{k) ^ 0 and 

(L2) ziF(x(fc)) := V(x{k + 1)) - F(x(fc)) < 0 tvhen x(fc) # 0 ami 
(L3) V{x{k)) is continuous inx{k) and 
(L4) V{x{k)) ^ 00 when ||x(fc)|| ^ oo. 

then the equilibrium state X(> = 0 is uniformly asymptotically stable in the large 
and V : E with n being the dimension of the state space is a so-called 

Ljapnnov function of the system, □ 

By Ljapnnov ’s theorem, verification of the self-stabilization property of a dis- 
tributed algorithm is transformed into the identification of a Ljapnnov function 
exhibiting the above properties. If such a function can be identified then “uni- 
form asymptotical stability in the large” [12] is guaranteed. Beyond others, this 
means that the system converges from any point in the state space to the only 
equilibrium state, namely the origin. Closure is also guaranteed: an equilibrium 
state is - once reached - never left in the absence of disturbances which modify 
the system state. Thus, self-stabilization is proven [1]. 

By using Ljapunov’s “Second Method,” proving self-stabilization basically is 
transformed into constructing a Ljapnnov function. Therefore on first glance, one 
might come to the conclusion that the problem still has not become any easier. 
However, the contrary is the case. Various strategies for constructing suitable 
Ljapnnov functions are known from literature (e.g. [9]). Those strategies, some- 
times dating back to the beginning of the last century, are waiting to be used. 
Additionally, through the more formal approach, certain necessary conditions 
for proving self-stabilization are derived “on-the-fiy.” In the next section, we 
will demonstrate the semi-automatic verification of a sample algorithm through 
our technique. 



4 An Example: Stabilizing Maximum 

In the following, we present and verify a self-stabilizing distributed algorithm to 
which we refer to as Stabilizing Maximum algorithm. The sample algorithm is 
quite simplistic but suffices for demonstrating the basic idea and application of 
the proposed technique. For an example of a more complex algorithm verified 
through this technique please refer to [18]. 

We assume a distributed algorithm consisting of n processes Fiji = 1, . . . , n. 
Every process Pi can directly (e.g., through reading of the corresponding com- 
munication variable) communicate with a subset of the other n — 1 processes 
but not necessarily with all of them. The only requirement is that every pro- 
cess can somehow communicate with all other processes, i.e., for any pair 
Pjj process Pi can either directly communicate with Pj or transitively. Transi- 
tive communication assumes that there exists a path from ’ Pj^^Pj with 

fci, . . . , fcg e { 1 , . . . , n} \ {i, and Pi can directly communicate with P^^ , P^^ 
can directly communicate with Pj and P^i can directly communicate with Pki^t 
for I = 1 , . . . , g — 1 . Beyond this, no specific communication topology is assumed. 
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The sub-algorithm executed by each process Pi is given in Fig. 3. Si is process 
PiS local state. .. denote the local states of processes with which 

process Pi can directly communicate. For ease of description, we call those pro- 
cesses neighbors of For every neighbor, there exists a guarded command in 
Pi^s body: when active and selected, process Pi adjusts its own local state by 
copying the local state of the particular neighbor into its own local state variable. 



process Pi 

var Si init local state 

{* , Li^nii are communication variables *} 

begin 

Si c! Li^\ ^ Si := Li^\ 

I Si < Li^2 ^ Si := Li^2 

I Si Lii^rfii ^ Si Li^rai 

end 

Fig. 3. Process F, z = 1, . . . of the sample algorithm 



Let a most recent initial local state be the local state which a particular 
process has adopted due to the most recent failure situation and not due the 
execution of an action. Now, we can formulate the following theorem. 

Theorem 2 (Self-Stabilization of the Stabilizing Maximum Algorithm) 

The stabilizing maximum algorithm self- stabilizes to a special unison state^ nameli 
a system state in which all n processes have an identical local state and that this 
local state ^ interpreted as a natural number^ is the maximum of all most recent 
initial local states, □ 

Next, we will prepare the corresponding proof. 



4.1 Matching the Algorithm to the System Model 

We define the state vector x according to (1), i.e., x^(fc) is the local state of 
process Pj at time fc for alH = 1,. . . ,n and fc € Nq. Thus, the state vector 
gives the system state. The arbitrary initial state is given according to (2). Let 
^max be the niaxinium niax{6’i, . . . , 6’^,} of all most recent initial local states. A, 
B € M(n X n, M) are unity matrices. 

The heart of the algorithm consists of ^ = J2k=i guarded conimands. As 
described in Sect. 3.1, every guarded conimand leads to the definition of a regu- 
lator, and all guarded conimands together define the switching function. Assume 
that regulator Ri with I — j s- implements action^ j. According to 
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Fig. 1 regulator Ri is given by H(fc) = ' x(fc) + db Taking the corresponding 

guarded command into account then matrix and vector must be specified 
as 



O — with 



— 1 if u = i and v = i 
1 if u = i and v = j 
0 otherwise 



and 



d^ = 0 



( 7 ) 

(8) 



Thus, when this particular regulator is selected at time k then the subsequent 
system state x(fc + 1) evaluates to 

A • x{k) + B • (w + r^) = x{k) + • x{k) + = x{k) + • x{k) 



In other words, regulator Ri overwrites the local state of process Pj with Pj^s 
local state. The switching function is given below. 



s{x{k)) 



1 if Si < and selected 

2 if Si < Li ^2 and selected 

< ^n.nir, and selected 
0 otherwise 



(9) 



Next, we have to define the desired state w. Since we expect the system to sta- 
bilize such that every local state will adopt the value Sraaxj w must be set to 
[srruxxi • • • ? ‘ Through the final transformation as given by (5) and (6), we 

obtain a system which behaves as the untransformed one but with the difference 
that the desired state lies in the origin of the state space. Note, that no modifi- 
cation of the guarded commands due to the transformation is necessary, because 
the boolean expressions and actions preserve their original meaning: for example 
remains valid when subtracting a fixed number from all components 
of the state vector. The same holds for an action := From now on, in 

order to simplify terminology, x, w etc. always refer to the transformed system. 



4.2 Proving Self- Stabilization of the Algorithm 

The system obtained in the previous section allows the use of LjapunovN stability 
criterion. In order to make the criterion work, we require a suitable Ljapunov 
function. But how can this function be identified if one lacks any idea of how 
this function might look like? 

A widespread standard starting point for identifying a Ljapunov function in 
engineering domains is the following (see [9]): 

Vix{k}} = x^'{k}-P-x{k) 



( 10 ) 
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with P = € M{n X n,M) being a time-invariant matrix. Based on the 

expectation that this starting point is general enough for “capturing the dynamic 
behavior of the system/^ all what is left to do, is to identify the matrix 
elements Pu,v such that the constraints (LI) - (L4) of Theorem 1 are satisfied. 
Because the system is specified in a very formal manner, the identification of 
those matrix elements - if existent - can be assisted and sometimes automatically 
be solved by mathematical tools, like Mat lab. Constraint (L2) seems to be most 
restrictive and is therefore evaluated first. 

Proving (L2): Let Ri denote an arbitrary regulator. Then, 

AVixikj) = V{x{k + 1)) - F(x(fc)) = x^'ik + 1) • P • x(fc + 1) ^ x^(fc) • P • x{k) 
= [A • x{k) + B • u(fc)]^ • P • [A • x{k) + B • u{k}] - x^'ik) ■ P • x{k) 

= [A • x{k) + B • (w + r^(fc)V • P • [A • x{k) + B • (w + r^(fc))] 

^ x^(fc) • P • x{k) 

= [x{k) + (fc)] ^ • P • [x{k) + (fc)] - x^(fc) • P • x{k) 

= [x(fc) + C^x(fc)] ^ • P • [x{k) + C^x(fc)] ^ x^(fc) • P • x{k) 

= x^{k) ■ [(C^)^ • P + (C^)^ • P • + P • C^] • x{k) (11) 

AV (x(fc)) is required to be equal to zero if ^ = 0 and less than zero if ^ = 1 , . . . , p. 
The former is clearly the case, since the zero regulator does not modify the system 
state. In order to prove the latter, the fact can be exploited that regulator J?/, 
^ 7 ^ 0 , is only selected, if its corresponding guard - say guards ^ - calculates to 
“true.^^ Thus, we know that in this case Si < holds. Expressed in terms of 
the model this means that x^ (fc) < Xj(fc) for a particular j ^ i. 

Based on this, the following inequalities must be guaranteed by a suitable 
choice of P. 

x^(fc)- [(C^)^-P+(C^)^-P-C^ + P-C^] -x(fc) <0ifx^(fc) <Xj{k) (12) 

with i = 1, . . . ,n and j = 1, . . . ,mi. Through continued direct evaluation, one 
obtains 

X,(t). 5 ;; Xr{k) -Pj^r -Xi{k) ■ E] Xr{k) ■ Pi^r < 0 if Xj(fc) < Xj (k) 

Yi<r<n- J J 

(13) 



Let Xj(fc) + aij{k) = Xj{k) with aij{k) > 0. Then, (13) rewrites to 



(Xj(fc) + Ojj(fc)) 



E Mk)-Pj,r 



L IKrKn 






^ Xr{k)-Pi^r 



t 1 <r<n 



<0 



ifxi(fc) <Xj(fc) (14) 
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In order to further simplify the inequalities, one can either use the help of math- 
ematical tools or try some matrices which are easy to handle: a matrix P of 
diagonal type with = 0 for all u 7^ a leads to 

(Xj(fc) + a*j(fc)) • Xj (fc) • Pjj - Xi{k) ■ Xi{k) ■ Pi^i = 

(xj(fc) + ai,j{k) f -pjj - xf (fc) • pi^i = 

xf (0(PiJ ■ ai,j{k) ■ Xi{k) +pjj ■ 4,j{k) < 0 

lixiik) <Xj{k) (15) 

Since aij{k) > 0 and Xj(fc) < 0, (15) is satisfied if pjj > 0 and pjj — Pi^i- As a 
result, a possible choice of matrix P is 

P=(p,,-) with P^J ■■= {l tlZLe 



where a € M.+ . 

Proving (LI): It is easy to see that V (x(fc)) = x^(fc)'P'x(fc) is zero if x(fc) = 0 and 
greater zero if x(fc) ^ 0. Hence, V^(x(fc)) is positive definite if x(fc) ^ 0. Actually 
in this special case, V (x(fc)) turns out to be an instance of a generalized Euclidian 
norm [6]. Consequently, (LI) is trivially satisfied. 

Proving (L3): V^(x(fc)) solely consists of continuous functions of x(fc). Thus, 
V^(x(fc)) must be continuous in x(fc). 

Proving (L4): Finally, it must be guaranteed that if an arbitrary norm of x(fc) 
approaches infinity, so does V^(x(fc)). Since in our case, V^(x(fc)) is itself a norm, 
the constraint is automatically satisfied. 

Because (LI) - (L4) hold using V^(x(fc)) = x^(fc) ’ P ’ x(fc) with P as given 
by (16), Theorem 1 applies: the system is guaranteed to stabilize from any state 
to the equilibrium state w = 0. This corresponds to the system state of the 
untransformed system in which all local states eventually evaluate to Smax- This 
proves Theorem 2. □ 

5 Conclusion 

In this paper, we have shown that Ljapunov’s century-old theory can in fact be 
used to elegantly prove the self-stabilization property of distributed algorithms. 
The traditional proof strategy as used in computer science for reasoning about 
self-stabilization is based on a variant function. Whether one succeeds in finding 
such a function or not primarily depends on the proof designer's expertise. A 
successful outcome is by no means guaranteed. 

By using Ljapunov^s ‘^Second Method, proving self-stabilization is basically 
reduced to constructing a Ljapunov function. Thus, on first glance, one might 
come to the conclusion that the problem still has not become any easier. However, 
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we believe the contrary: many strategies for constructing a suitable Ljapunov 
function can be found in literature. Some of those strategies are dating back 
to the beginning of the last century. Through our technique, they could be re- 
activated for the benefit of program verification. Additionally, we could observe 
that through the more formal approach, certain necessary conditions for proving 
self-stabilization are derived “on-the-fiy” as it is the case in the sample algorithm 
given in the paper. Although creative ideas of the proof designer will always ease 
the construction of a proof, our approach presents a formal framework for the 
proof designer which helps to focus creative work on crucial hot spots in a very 
precise setting. 

We are currently refining and extending our system model in order to more 
easily cope with nondeterminism and fairness aspects. For a more general and 
semi-automatic utilization of the presented verification technique, we are working 
on a “multi-level layering” of Ljapunov functions. Through this layering, we hope 
to obtain an identification strategy for “non-scalar” Ljapunov functions. This is 
expected to cover the cases where a conventional scalar Ljapunov function cannot 
be identified for proving stability: for instance when proving the Game of Cards 

[2] with more than two players one is confronted with this problem. For such 
situations, we think of a second Ljapunov “sub” -function which still manifests 
convergence, leading to a general proof for all n. We hope to more formally 
report on this result, which has interesting analogies to lexicographically ordered 
variant functions, in the future. We hope that our approach helps to turn the 
goal of verifying self-stabilizing distributed algorithms “into a craft rather than 
preserving it an art.” 
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Abstract. Refining self-stabilizing algorithms which use tighter schedul- 
ing constraints (weaker daemon) into corresponding algorithms for weak- 
er or no scheduling constraints (stronger daemon), while preserving the 
stabilization property, is useful and challenging. Designing transforma- 
tion techniques for these refinements has been the subject of serious in- 
vestigations in recent years. This paper proposes a transformation tech- 
nique to achieve the above task. The heart of the transformer is a self- 
stabilizing local mutual exclusion algorithm. The local mutual exclusion 
problem is to grant a process the privilege to enter the critical section if 
and only if none of the neighbors of the process has the privilege. The con- 
tribution of this paper is twofold. First, we present a bounded-memory 
self-stabilizing local mutual exclusion algorithm for arbitrary network, 
assuming any arbitrary daemon. After stabilization, this algorithm main- 
tains a bound on the service time (the delay between two successive ex- 
ecutions of the critical section by a particular process). This bound is 
i^x(n-i) ^ jg network size. Second, we use the local mutual ex- 

clusion algorithm to design two scheduler transformers which convert the 
algorithms working under a weaker daemon to ones which work under 
the distributed, arbitrary (or unfair) daemon, both transformers preserv- 
ing the self- stabilizing property. The first transformer refines algorithms 
written under the central daemon, while the second transformer refines 
algorithms designed for the fe-fair (fe>(n — 1)) daemon. 

Keywords: Local mutual exclusion, self- stabilization, transformer, un- 
fair daemon. 



1 Introduction 

One of the most inclusive approaches to fanlt-tolerance in distributed systems 
is self- stabilization [Dij74,Dol00]. Introduced by Dijkstra [Dij74], this technique 
guarantees that, regardless of the initial state, the system will eventually con- 
verge to the intended behavior. The correctness of self-stabilizing algorithms is 
proven assuming some type of scheduler (or daemon) as the adversary. The two 
most common schedulers are the following: the central scheduler — only one pro- 
cess can execute an atomic step at one time — and the distributed scheduler — any 
nonempty subset of the enabled processes can execute their atomic steps simul- 
taneously. Although it is easier to prove the stabilization for the algorithms 
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working under the central scheduler, but those working under the distributed 
scheduler supports more practical implementations. So, it makes sense to de- 
sign self-stabilizing algorithms under central scheduler, prove its correctness in 
this model, and then be able to transform the algorithms to their corresponding 
algorithms under the distributed scheduler preserving the self-stabilization and 
other desirable properties. One of the main goals of this paper is to design such 
a transformer. The dining philosophers problem [DijTl] deals with the mutual 
exclusion among neighboring processes in a ring. The local mutual exclusion 
problem is the extension of this problem to any arbitrary network. We propose 
a bounded memory and bounded service time local mutual exclusion algorithm. 
Then we demonstrate the application of this local mutual exclusion algorithm 
to design a transformer to transform algorithms working under a weaker dae- 
mon to ones which work under the distributed arbitrary daemon, preserving the 
self-stabilization property. 

Related Work. There have been several attempts to develop transformers 
that transform a program written and proven under the assumption of a weak 
daemon to a program self-stabilizing under a strong daemon. A special class of 
systems, called alternator, was introduced in [GH97] for the linear topology and 
in [GH99] for any arbitrary topology. The idea of an alternator is the following: 
Every process has an integer state variable which is bounded by 2d — 1, where d 
is the length of the longest simple cycle in the network. One of the main features 
of the alternator is that no two enabled neighboring processes can have the 
state variable equal to 2d — 1 at the same time. The processes do some effective 
work only when they are in state 2d — 1. This non-interference property is used 
in [GH99] to design a transformer from the central daemon to the distributed 
daemon. The transformation idea is to compose the algorithm A (written under 
the central daemon) with the alternator such that the actions of the algorithm 
A are executed only when the alternator is in state 2d — 1 . Another transformer 
was proposed in [MN97]. This method uses timestamps to order the actions of 
any self-stabilizing algorithm. One notable feature of [MN97] is that it achieves 
the silent stabilization [DGS96], but the algorithm used unbounded variables. 

Another approach in designing transformation techniques is to implement 
the local mutual exclusion among the neighboring processes. Distributed, but 
non-stabilizing solutions to the local mutual exclusion problem are presented in 
[CM84] and [AS90]. The alternator [GH99] can also be considered as a solution 
to the dining philosophers problem. But, the method in [GH99] does not solve 
the local mutual exclusion problem for the following reason: Only when one and 
exactly one among the neighboring processes is in state 2d — 1, the process can 
enter the critical section. But, in this algorithm, there are many configurations 
where none of the neighboring processes is in state 2d — 1, meaning, none of 
them is in critical section. The algorithms in [HP89] and [Gou87] propose self- 
stabilizing solutions to the dining philosophers problem (and hence, to the local 
mutual exclusion problem). But, both solutions use a central daemon and a dis- 
tinguished process to implement the token circulation. The process holding the 
token executes its critical section. Another self-stabilizing solution to the din- 
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ing philosophers problem has been proposed recently in [HnaOO] . The algorithm 
in [HnaOO] works under the read/ write atomicity model [DIM93], but makes a 
very strong assumption — the links of the network are (initially) colored in a spe- 
cial way. A local mutual exclusion algorithm for tree networks is presented in 
[JADT99]. Another solution on trees is proposed in [AS99]. This solution uses 
bounded memory and the read/write atomicity model. But, the proposed algo- 
rithm does not satisfy the local mutual exclusion property during a short time 
when the variables are wrapped around to maintain their bounded nature. 

Recently, a transformer using the local mutual exclusion has been reported 
in [AN99]. Their method is focused on the refinement of atomicity — from high to 
low, and works for the finest atomicity grain, i.e, read/write atomicity [DIM93]. 
The solution also uses bounded variables. But, since the algorithm uses a weakly 
fair daemon, although the service time is bounded, the exact bound on the service 
time cannot be computed (since it depends on the type of daemon). 

Our Contributions We first present two solutions to the local mutual ex- 
clusion problem. The first solution uses unbounded memory. We then extend 
the first algorithm to design a bounded memory solution. Both algorithms work 
in the read/ write atomicity model [DIM93]. So, our algorithms use the same 
model as in [AN99], but, unlike their solution, ours is self-stabilizing under any 
arbitrary (unfair) distributed daemon. After stabilization, the service time is 
bounded by 

Then we use the local mutual exclusion algorithm to design two stabilizing 
preserving transformers to transform algorithms written using weaker daemon 
into algorithms which work under the assumption of a stronger daemon. The 
first transformer refines algorithms written under the central daemon, while the 
second transformer takes as input algorithms designed for the fc-fair {k > {n — 1}) 
daemon. Both transformers convert the input algorithms to self-stabilizing algo- 
rithms which work under the assumption of arbitrary (even unfair) distributed 
daemon. 

Outline of Paper. The rest of the paper is organized as follows: The model 
and specification of the problem solved in this paper is presented in Section 2. In 
Section 3, two local mutual exclusion algorithms are given. The two transformers 
are discussed in Section 4. Section 5 provides some concluding remarks. 



2 Model and Specification 

A distributed system is a set of state machines called processes. Each process 
can communicate with a subset of the processes called neighbors. We will use 
M.x to denote the set of neighbors of node x, and |A'*.:r| to represent the number 
of neighbors of x. The communication among neighboring processes is carried 
out using the communication registers (called “shared variables” throughout this 
paper). We consider distributed systems consisting of n processes where every 
process has a unique identifier. The system’s communication graph is drawn 
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by representing processes as nodes and the neighborhood relationship by edges 
between the nodes. 

Any process in a distributed system executes an algorithm which contains 
a finite set of guarded actions of the form: (label) :: (guard) — > (statement)^ 
where each guard is a boolean expression over the local and shared variables. 

A configuration of a distributed system is an instance of the state of the 
system processes. A process is enabled in a given configuration if at least one of 
the guards of its algorithm is true. We denote the set of enabled processes for a 
given configuration by E. 

A distributed system can be modeled by a transition system. A transition 
system is a three-tuple S — (CjTjI) where C is the collection of all the con- 
figurations, J is a subset of C called the set of initial configurations, and T is 
a function T : C — ^ C. A transition, also called a computation step, is a tuple 
(ci,C 2 ) such that C 2 = T(ci). A computation of an algorithm P is a maximal 
sequence of computations steps e = ((co,ci) (ci,C 2 ) . . . (ci,Ci+i) . . .) such that 
for i > 0, Ci+i = T(ci) (a single computation step) if c^+i exists, or Cj is a termi- 
nal configuration. Maximality means that the sequence is either infinite, or it is 
finite and no process is enabled in the final configuration. All computations con- 
sidered in this paper are assumed to be maximal. A fragment of a computation 
e is a finite sequence of successive computation steps of e. 

In a computation, a transition (ci,Ci+i) occurs due to the execution of a 
nonempty subset of the enabled processes in configuration Cj. We assume the 
read/write atomicity model [DIM93] with the semantics of [AN99]: In a com- 
putation step, a process either reads one of its neighbors^ state, or writes its 
local state, but not both. In every computation step, this subset is chosen by 
the scheduler or daemon. We refer to the following types of daemon in this 
paper: central daemon — in every computation step, only one of the enabled 
processes is chosen by the daemon; k-fair daemon — a process cannot be select- 
ed more than k times by the daemon without choosing another process which 
has been continuously enabled; weakly fair daemon — if a process p is continu- 
ously enabled, p will be eventually chosen by the daemon to execute an action; 
distributed daemon — during a computation step, any nonempty subset of the 
enabled processes is chosen by the daemon. 

We refer to the distributed unfair daemon as the stronger daemon and all 
other daemons (defined above) as the weaker daemons. 



Self- Stabilization. In order to define self-stabilization for a distributed system, 
we use two types of predicates: the legitimacy predicate — defined on the system 
configurations and denoted by C — and the problem specification — defined on the 
system computations and denoted by SV. 

Let T be an algorithm. The set of all computations of the algorithm V is 
denoted by £p. Let A be a set and Pred be a predicate defined on the set A. 
The notation x h Pred means that the element x of X satisfies the predicate 
Pred defined on the set T. 
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Definition 1 (Self-Stabilization). An algorithm V is self- stabilizing for a 
specification SV if and only if the following two properties hold: 

(1) convergence — all computations reach a configuration that satisfies the legit- 
imacy predicate. Formally^ Ve € :: e = ((cq, Ci)(ci,C 2 ) . . . ) : 3n > H C; 

(2) correctness — all computations starting in configurations satisfying the le- 
gitimacy predicate satisfy the problem specification SV. Formally^ Ve € £p :: 
e = ((co,ci) (ci,C 2 ) . . . ) : Co h C ^ e h SV. 

Local Mutual Exclusion. The specification of the local mutual exclusion prob- 
lem (SVc) is the conjunction of two predicates — safety and liveness — defined in 
terms of the “privilege to enter the critical section” : safety predicate — in any 
configuration, there exists at least one privileged process, and if a process holds 
a privilege, then none of its neighbors holds the privilege; liveness predicate — 
every process holds the privilege infinitely often. 

Definition 2 (Fairness Index). Let V be a self- stabilizing mutual exclusion 
algorithm. V is considered to have a fairness index of k if in any computation 
of V under the assumption of any daemon^ between any two consecutive critical 
section executions of a process^ any other process can execute its critical section 
at most k times. 

Let 7^ be a self-stabilizing local mutual exclusion algorithm. The service time 
of V is the maximum number of critical sections executed by other processors 
between two successive executions of the critical section by any process without 
any assumption of the daemon. 

Virtual Orientation of the Communication Graph. We will use an “adjacency” 
relation, denoted by D>, over the shared variables of the processes to define a 
virtual orientation of the communication graph. The exact definition (or im- 
plementation) of this relation will be specific to the two solutions to the local 
mutual exclusion problem presented in Section 3. 

Definition 3 (Virtual Orientation). Let x.p andx.q be two shared variables 
of two neighboring processes p and q. In the communication graphs the edge 
between p and q is said to be virtually oriented from p to q if and only ifx.pt>x.q. 
This edge is an incoming edge for q and an outgoing edge for p. 

Definition 4 (Privileged Process). A process is said to be privileged in a 
configuration if in the communication graphs oil the edges adjacent to the process 
are oriented towards it (i.e.^ incoming edges). 

3 Local Mutual Exclusion 

We present two solutions to the local mutual exclusion problem in this section. 
We first present the unbounded space solution because this would help under- 
stand the ideas behind the second solution. Next, we will give the bounded 
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solution which is our final solution. In the next section, we will use the bound- 
ed solution to design two scheduler transformers. Both solutions use the edge 
reversal mechanism of [BG89] to maintain the acyclic orientation of the commu- 
nication graph. 

3.1 Unbounded Local Mutual Exclusion Algorithm 

In this subsection, we propose a self-stabilizing local mutual exclusion algorithm. 
The key feature of this algorithm is its {n — l)-fairness index under any arbitrary 
distributed daemon. 



Constants: 

id.p : unique integer identifier of p; 

Af.p : the set of neighbors of process p; 

Shared Variable: 

L.p : unbounded integer; 

Local variable: 

L_copy[\Af .p\] : array of unbounded integers containing the copy of L of neighbors; 
CS : boolean flag used to indicate if a process is in the critical section or not; 

Function: 

e jV.p) -.p<q = L.copy[q] > L.p 

Actions: 

Al : Vg € U.p, p <q 
(75 = 1; 

execute critical section; 

L.p = max{L^opy[q]\ q e M.p} + 1; 

CPi : 3g 6 M-Py q < p t\ L-Copy[q] ^ L.q — 0- 
CS=0; 

L.copy{q] = L.q] 

Algorithm 3.1: Unbounded Local Mutual Exclusion (ULME) for process p 



Algorithm uLME We borrow the definition of the “direction of an edge’^ 
from [GK93] to define the adjacency relation (d>) for Algorithm ULME (shown 
as Algorithm 3.1). 

Defiuitiou 5 . For any two neighboring processes p\ andp2 executing Algorithm 
ULME^ L.p2 t> L.pi iff {L.pi < L.p2) V {{L.pi = L.p2) A {id.pi < id.p2))- We 
refer to this situation as is virtually oriented towards P2 ^7 

Remark L Due to the uniqueness of the process ids, D> is a total order (anti- 
refiexive, anti-symmetric, and transitive) relation. 
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The virtual orientation is used in Algorithm ULME in the following manner: A 
process enters its critical section if and only if it is privileged (Definition 4), i.e., 
all its incident edges are oriented towards it. Once the process finishes executing 
its critical section, it reverses all its incident edges. 

Every process p maintains a value from a totally ordered set that records 
locally a copy of (what it thinks is) the value of each of its neighbors. If p thinks 
that its own value is the local minimum, then it becomes privileged. After using 
the privilege, p sets its value to the maximum of the locally recorded neighbors^ 
values plus one (Action Ai). Otherwise, if p thinks that there is a neighbor q (of 
p) that has a value less than p^s, p checks if its locally recorded value is correct. 
If not, p updates its local record (Action CPi). 

Note L The variable CS in Algorithm ULME is not necessary to solve the local 
mutual exclusion problem. We added this in the code in Algorithm 3.1 because 
we would need this to design the transformers in Section 4. 



Correctness of Algorithm uLME We now give the steps of the proof, pro- 
viding bounds for the service time and fairness. 

Definition 6 (Legitimate Configuration). A legitimate configuration for 
Algorithm ULME (i,e,^ a configuration which satisfies the legitimacy predicate 
Ltjlme) is a configuration such that: (i) At least one process is privileged and 
(a) no two neighbors are privileged. 

Lemma 1. Let G be the communication graph representing the system executing 
Algorithm ULME. The graph G is acyclic. 

Lemma 2 (Convergence). Every computation of Algorithm ULME reaches a 
configuration satisfying the predicate Ejjlme* 

Lemma 3. In any computation of Algorithm ULME, between every two succes- 
sive instances of a process being privileged, all its neighbors are privileged. 

Lemma 4 (Liveness). Every process is privileged infinitely often in every com- 
putation of Algorithm ULME. 

Lemma 5 (Fairness index). The fairness index of Algorithm ULME is (n—l). 

Theorem 1. Algorithm ULME is a {n — 1) fairness index, self- stabilizing local 
mutual exclusion algorithm under any unfair daemon. 

Lemma 6 (Service Time). The delay between two successive executions of the 
critical section by a particular process in Algorithm ULME is bounded by . 

3.2 Refinement to bounded memory 

The main drawback of Algorithm ULME is its unbounded memory requirement. 
In the following, we propose a bounded solution. The new algorithm is self- 
stabilizing and has the fairness index of (n — l) under any arbitrary daemon. 
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Algorithm Algorithm BLME is a refined version of Algorithm ULME 

with bounded variables. This transformation deals with two major related prob- 
lems — the maintenance of the acyclic orientation of the communication graph 
and the choice of a bound for the variable L. Once the acyclic orientation is 
provided, Algorithm BLME (as shown in Algorithm 3.2) is quite similar to Al- 
gorithm ULME. 

In the following, we redefine the adjacency relation D> to deal with the bound- 
ed integers. We then provide a bound for the variable L. 



Definition 7 (Cyclic Comparison). Let x and y be two integers bounded by 
a positive integer B > 2. We define the relation D> as follows: 

- Vare [0,f].- 

1. y>xiffy£[x+l,x+^]. 

2. X t> y iff y e [y + X + l-i — 1]\J [0,x — 1], 

- Va; e [-f + 1, B - 1],- 

1. y t> X iff y e [x + 1, B - l]l) [0, X - f]- 

2. X t> y iff y e [x — Y + 1, X — 1], 



X X 





Fig. 1. Cyclic Comparison. 



Eigure 1 shows the (cyclic) relation between |/i, |/ 2 , and y^. The example 
on the left in Eigure 1 shows the situation yi t> x% x t> y 2 ^ and x > y^, and the 
right example indicates that yi t> Xj y 2 > x% and xt>y^. 

In the following, based on the relation defined in Definition 7, we redefine the 
virtual orientation of the communication graph using the variable L bounded by 
B. This orientation must provide the acyclic nature to the communication graph. 
We choose the value of B as follows: In order to ensure the acyclic orientation, 
the values of L of the neighboring processes must be different. In a completely 
connected graph, every node has {n — 1) neighbors. Therefore, the lower bound 
for the distance between the values of L of two neighboring processes is n. In 
order to avoid formation of a cycle among the nodes of the communication graph, 
the sum of n gaps between n processes which form the cycle must be less then 
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B. The minimum value of B which satisfies the above condition is when n is 
even and + 1 when n is odd. We give a more formal argument for the above 
explanation in Lemma 7. 

Note that the cyclic comparison is an order when all the compared values 
are in an interval bounded by n. 

Definition 8 (Bounded Virtual Orientation) . Let pi and P2 be two neigh- 
boring processes executing Algorithm BLME. pi is virtually oriented towards p2 
if they satisfy the properties specified in Definitions 3 and 7 (defined over the 
variable L for B = n? if n is even and B = n‘^ + 1 if n is odd). 

Note that the variable L of each process executing Algorithm BLME has 
values in [ 0 ..B — 1] where B is the constant defined in Definition 8. 

As in Algorithm ULME, Algorithm BLME also maintains an acyclic com- 
munication graph in all legitimate configurations. To achieve this characteristic 
of the underlying graph, we force the processes to satisfy the following prop- 
erty, called balance, (We will show in Section 3.2 how the balanced processes 
guarantee the communication graph to be acyclic.) 

Definition 9 (Balanced processes). Two neighboring processes^ pi and P2^ 
running Algorithm BLME are said to be balanced with respect to each oth- 
er (or Pi and P2 are balanced, in short) if and only if |T.pi — L.p2\ < n A 
((L.pi 7^ L.P2) V {L.pi = L.p2 = 0 A id. Pi < id.p2))- When all processes are bal- 
anced with respect to all their neighbors, we refer to this situation as a balanced 
configuration. We use the notation p± ^ p2 to indicate that p± and p2 are two 
neighboring balanced processes. 

Let Pi and P2 be two neighboring processes which are not balanced. We caM these 
processes unbalanced and denote this condition as pi 9^ P2 * If at least two pro- 
cesses are unbalanced^ we say that the system is in an unbalanced configuration. 

A process p is enabled to execute its critical section if it is balanced (Defini- 
tion 9) with its neighbors and if all the edges adjacent to p are oriented towards 
p (Action Ml). After p exits its critical section, p reverses its incident edges. 
This allows the neighbors of p to get a chance to execute their critical section. 
In a self-stabilizing setting, the system may start in an unbalanced configuration 
(Definition 9). Starting from this unbalanced configuration. Algorithm BLME 
will eventually take the system into a balanced configuration, and the communi- 
cation graph becomes acyclic again. Erom this configuration onwards. Algorithm 
BLME always moves from one balanced configuration to another, and the com- 
munication graph will remain acyclic. 

When a process p recognizes that (i) it is unbalanced with respect to at least 
one of its neighbors, or (ii) it has a neighbor i such that Rj = 1 (which implies 
that some processes are unbalanced), p executes Action TZi and sets its reset 
marker Rp to 1 . p then waits until all its neighbors also reset their J? to 1. At 
that time, p changes its L variable to 0 (Action H2)- Again, p waits until all its 
neighbors reset their L to 0. When that happens, p is balanced again with all 
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Constants: 

B : if n is odd^ + 1 if n is even. 

Af.p: the set of neighbors of process p; 

Shared Variables: 

L.pe[0..B^l]; 

R.p: boolean; reset flag 
Local Variables: 

L-Copy[\Af .p\] : array of unbounded integers containing the copy of L of neighbors; 
R-Copy[\Af .p\] : array of boolean containing the copy of R of neighbors; 

CS : boolean flag used to indicate if a process is in the critical section or not; 

Functions: 

MaxL(p) = L.i such that \L.copy[i] - L.p\ = max{\L.copy\j] - L.p\,\fj e A»; 

(Vi e jSf.p) -.p<i = (L.p <1 L.copy\i]) V {{L.p = L.copy[i] = 0} A (id.p < id.i)); 

(Vi e V.p) : p ~ i = |L.p - L.copy\i]\ < n A {L.p ^ L^opy[i]) V {L.p = L.copy\i] = 
0 A id.p < id.i); 

(Vi e V.p) :pT^i= \L.p - L^opy[i]\ > n V {{L.p = L^opy[i]) A {L.p / 0)); 
Actions: 



Al : {R.p = 0) A (Vz 6 Af.py p i Ap ^ i A R-Copy[i] = 0) —A 
CS=1; 

execute critical section; 

L.p = MaxL{p) + 1; 

IZi : {R.p = 0) A {3i € Af-P^ p ^ iV R-Copy[i] = 1 A {L_copy[i] / 0 V L.p / 0)) 
CS=0; 

R.p=l; 

Tl 2 : {R.p = 1} A {L.p ^ 0} A (Vi e R-CO-py\i] = 1) ^ 

CS=0; 

Rp = 0\ 

R-i ■■ (R-P = 1} A {L.p = 0} A(Vi e V.p, L.copy[i] = 0} ^ 

CS=0; 

R.p=0; 



CPi : {R.p = 0) A (Vj € Af-Py p ^ j A R-Copy[j] = 0) A (3z € Af.p^ i p A Ljcopy[i] / 
L.i) ^ 

CS=0; 

L-Copy[i]=L.i; 

CP '2 : {R-P = 1) A {L.p / 0) A {3i € Af.p^ R-Copy[i] = 0 A R.i = 1) — A 
CS=0; 

R_copy[i]=R.i; 

CP:i : {R.p = 1) A {L.p = 0) A (3z € Af.p^ L_copy[i] / 0 A L.i = 0) 

CS=0; 

L-Copy[i]=L.i; 

Algorithm 3.2: Bounded Local Mutual Exclusion (BLME) for process p 
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its neighbors, p now clears to 0 so that it can eventually execute its critical 
section (Action 7 ^ 3 ). 

The role of Actions CPi^ {i = 1 . . .3) is to keep the local copies of L and R 
(i.e., the variables L-Copy and R-Copy) up-to-date. These actions are executed 
every time the value of the local variables changes. 

Note 2, The variable CS in Algorithm BLME is not necessary to solve the local 
mutual exclusion problem. We added this in the code in Algorithm 3.2 because 
we would need this to design the transformers in Section 4. 



Correctness of Algorithm First, we derive a property of the commu- 

nication graph G, representing the system running Algorithm BLME, from the 
unbounded virtual orientation (as per Definition 5) and the “balanced” proper- 
ty (Definition 9). This infiuences the definition of the legitimate configuration 
(defined in Definition 10). To prove the correctness of Algorithm BLME, we 
first show that starting from a legitimate configuration, any computation always 
maintains the legitimacy — the processes execute only Action A\. Finally, we 
prove the convergence by defining different possible scenarios in terms of edges 
of G and using an “edge migration” scheme. 

Lemma 7. Let G be the communieation graph representing the system running 
Algorithm BLME, If the system is balaneed^ then G is acyclic. 

Property 1, If the system is balanced, then there exists at least one privileged 
process and no two neighbors are privileged. 

Definition 10 (Legitimate Configuration). A legitimate configuration for 
Algorithm BLME (i,e,,a configuration which satisfies the legitimacy predicate 
Lblme) is a configuration such that the following two eonditions hold: (i) The 
system is in a balanced eonfiguration, (ii) The value of the reset marker R of all 
processes is 0. 

Remark 2, In a legitimate state in Algorithm BLME, the processes execute only 
Action A\, 

Lemma 8. Let e be a eomputation of Algorithm BLME starting from a legitL 
mate eonfiguration c (i,e,^ c satisfies Cblme)- Then any configuration reachable 
from c in e also satisfies Cblme- 

Now, we want to prove the convergence of Algorithm BLME. We first need to 
define some covering edge sets of the communication graph G representing the 
system running Algorithm BLME: (1) M\ = {(p, g) | p, and q are balanced 
and {R,p = 0 A R,q — 0)}; (2) M 2 — {(p, g) | p, and g are balanced, and 
{R.p = 1 V R.q = 1)}; (3) — {(p, g) | p, and g are unbalanced, and {R.p — 

OVJ^.g = 0)}; (4) = {(p, g) | p, and g are unbalanced and {R,p = lAl^.g = 1)}; 

(5) MT = Ui=i Miy(6) Vi e {2,3,4} : Mf = {{p,q) | {p,q) e Nh A {L.p = 
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In a legitimate configuration, all the edges are in Afi, i.e., MT = Afi, and the 
only action enabled at any process is Ai (Remark 2). When the system is in an 
illegitimate configuration, MT ^ Afi, and from the algorithm, at least one of 
Actions ^nd is enabled at some process. So, onr obligation now is to 

show that starting from such a configuration where MT ^ Afi, eventually, all 
edges will become part of Mi again. We explain this process by using an “edge 
migration” process. 

We will first show that starting from an illegitimate configuration, eventually, 
one of the processes would be able to execute one of Actions IZij IZ2J and 
meaning that the process of convergence would eventually start. We then prove 
the convergence process using the edge migration process as follows (shown in 
Figure 2): (a) Every edge of M^ eventually becomes a member of Mf{i ^ 0) or 
(h) Every edge of Mf{i ^ 1) eventually moves to (c) Every edge 

of Af (0^0) eventually moves to set Mir which is defined as follows: Mir contains 
all edges (p, q) such that p and q are balanced, R.p = 0, R.q = 0, the edge (p, q) 
was originally in a set Mi{i 7^ 1), and (L.p 7^ 0 V L.g 7^ 0). 

Note that Mir is actually the same as Mi except that the set Mir is created only 
due to a computation starting from an illegitimate state. So, once the system is 
back to a legitimate configuration, Mir becomes the current Mi. 

Lemma 9. Let e be a eomputation of Algorithm BLME starting from an illegit- 
imate eonfiguration c, ie., where MT ^ Mi. Then eventually one of the actions 
in {1Zij1Z2j1Z^} will be exeeuted in e. 




Fig. 2. Edge Migration 
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Lemma 10. Let e be a eornputation of Algorithm BLME starting in a configu- 
ration c where MT ^ M\, Then eventually^ all edges of G will belong to 

Remark 3, The liveness (Lemma 4) and (n— l)”fairness (Lemma 5) results proven 
for Algorithm ULME also hold for Algorithm BLME because the orientation of 
edges is changed in an identical manner in the two algorithms. 

Theorem 2. Algorithm BLME is a self- stabilizing loeal mutual exclusion algo- 
rithm. under an unfair daemon having the fairness index (n — l) and the service 
time 

4 Daemon Refinement 

In this section, we propose an application of the local mutual exclusion algo- 
rithms introduced in the previous section. In this section, we denote the al- 
gorithms presented in the previous section as Algorithm Alme- We use those 
algorithms to transform self-stabilizing algorithms proven under some weaker 
daemon (e.g., the central or fc-fair daemon) called weaker algorithms to self- 
stabilizing algorithms which would work in the presence of the stronger daemon 
(i.e., the distributed unfair daemon). Erom now on, we refer to these algorithms 
as the stronger algorithms. 

Transformation Description The transformation technique is based on a par- 
ticular composition scheme between the actions of the local mutual exclusion 
algorithm {Alme) and a weaker algorithm (W). Assume that Alme and (W) 
algorithms has m and n actions, respectively. Let aime9i (respectively, wpi) and 
aime^i (respectively, wsi) represent the guard and statement, respectively, of 
Rh action of Alme (respectively, W) algorithm. The composed algorithm, 5, 
consists of the following actions: 

— Vi € [l..?7r],Vj € <C aim^gj !!> A <C wgj ) 

if CS = 1 then < wsj > 

— Vi E [l..?7r]: ^ vjgi A . . . A wgn, ^ ^ ^ 

Note 3. The actions of the weaker algorithm are executed only when CS = 1, 
i.e., when the process is allowed to execute its critical section. The variable CS 
is used only to design the transformers (see Notes 1 and 2). 

Lemma 11. The composed algorithm is a self- stabilizing loeal mutual exclusion 
algorithm according to the specification SVc o.'nd its fairness index is {n — 1), 

4.1 Central Daemon to Distributed Daemon 

In this section, we show that we can transform a self-stabilizing algorithm work- 
ing under a central daemon into a self-stabilizing algorithm under a distributed 
daemon. 

Assume that, in the composition described in Section 4, the weaker algorithm, 
Wc? is a self-stabilizing algorithm for the specification SVc which works under 
a central daemon, and S represents the composed algorithm. 
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Theorem 3. Algorithm. S is self- stabilizing for the speeifieation SVc under a 
distributed daemon. 



4.2 Fair Daemon to Distributed Daemon 

We propose another application for the local nintnal exclusion: the transforma- 
tion of a self-stabilizing algorithm under a fc-fair daemon into a self-stabilizing 
algorithm under a distributed daemon. 

Assume that, in the composition described in Section 4, the weaker al- 
gorithm, WkB is a self-stabilizing algorithm for the specification under a 

fc"fair daemon (Vfc > (n — 1)), and S is the composed algorithm. 

Theorem 4. Algorithm S is self- stabilizing for the speeifieation SVj^ under any 
distributed daemon, 

5 Conclusions 

We presented a transformation technique to transform self-stabilizing algorithms 
under weak daemons into algorithms which maintain the self-stabilization prop- 
erty and also work under the stronger daemon, like any arbitrary distributed 
daemon (including the unfair daemon). The key tool in designing the above is 
a self-stabilizing local mutual exclusion algorithm, which by itself is a major 
contribution of this work. One of the two local mutual exclusion algorithm- 
s presented in this paper is a bounded memory self-stabilizing solution which 
is proven under the assumption of an unfair daemon in the read/ write model 
[DIM93] with the semantics definition of [AN99]. Another nice feature of our 
local mutual algorithms is that they achieve a bounded service time. 

Since our protocols work under the read/write atomicity model, they can 
easily be extended to the message-passing environment. 

Another possible extension for our bounded solution is a generalization of the 
unison problem defined in [CFG92]. In Algorithm BLME, the difference between 
the values of the local variables of any two neighboring processes is bounded. 
Therefore, each process can start a phase with the number equal to the value 
of its local variable L. The main features of this extension are that the phase 
difference between two processes is bounded, and no two neighboring processes 
are in the same phase. 

Acknowledgment The authors would like to thank the referees whose com- 
ments helped improve the presentation of the paper. 
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More Lower Bounds for Weak Sense of Direction 
The Case of Regular Graphs 
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Abstract A graph G with n vertices and maximum degree Aq cannot be given 
weak sense of direction using less than Aq colours. It is known that n colours 
are always sufficient, and it was conjectured that just + 1 are really needed, 
that is, one more colour is sufficient. Nonetheless, it has just been shown [2] 
that for sufficiently large n there are graphs requiring co{n/ log n) more colours 
than Aq. In this paper, using recent results in asymptotic graph enumeration, 
we show not only that (somehow surprisingly) the same bound holds for regular 
graphs, but also that it can be improved to Q{n log log n/ log n). We also show 
that Q (dQ^loglogdQ) colours are necessary, where dQ is the degree of G. 



1 Introduction 

Sense of direction and weak sense of direction [5] are properties of global consistency of 
the colouring of a network that can be used to reduce the complexity of many distributed 
algorithms [4]. Although there are polynomial algorithms for checking whether a given 
coloured graph has (weak) sense of direction [1], the polynomial bounds are rather 
high, and, moreover, there are no results (besides the obvious membership to NP) about 
finding a colouring that is a (weak) sense of direction using the smallest number of 
colours. 

The number of vertices n in a graph G is a trivial upper bound for the number of 
colours, and the maximum degree Ag is a trivial lower bound. However, Aq was essen- 
tially the only known lower bound; the difficulty of proving that Aq ^2 colours were 
necessary for some graph prompted for the conjecture that Ag + 1 colours were always 
sufficient [6]; the conjecture is of course of particular interest for regular graphs. Re- 
cently the authors proved that there are graphs requiring o:>{n/\ogn) additional colours [2] 
but the proof uses intensively random graphs of high degree: therefore, an extension of 
the proof to regular graphs appears difficult (as the theory of random regular graphs 
mainly considers fixed or slowly growing degrees). 

In this paper, using a recent result in graph asymptotic enumeration [8], we by- 
pass this problem and show that ^(n log log n/ log n) additional colours are neces- 
sary to give weak sense of direction to all regular graphs. This result strongly dis- 
proves the original conjecture, even when restricted to regular graphs. We also show 
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nella Speeifiea e Verifiea di Sistemi Complessi”). 
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that if the overall number of colours used is dependent on the degree do it must be 

Q. (JoVlog log Jg)- 

We remark that even if the main proof of this paper is a rather straightforward count- 
ing argument, it is based on an asymptotic estimate of the number of regular graphs 
enjoying a suitable property, and this estimate requires rather involved computations. 

A consequence of our proof is that almost all regular graphs in a certain range of 
degrees (see Theorem 3) have diameter two. There are presently no published results of 
this kind in the literature (see [10]), although Krivelevitch, Sudakov, Vu and Wormald 
are preparing a paper on these issues that covers a wider degree range [9]. However, 
we believe that the techniques used in the proof can be fruitfully applied to many other 
properties of random regular graphs. 

2 Definitions 

A (directed) graph G is given by a set V = [n] = {0, I, . . . , n — 1} of n vertices and 
a set A c y X y of arcs (note that the graphs in this paper are not considered up to 
isomorphism — ^using a common terminology, they are labelled). We write P[x,y] c 

for the set of paths from vertex x to vertex y. A graph is symmetric if (y, x) is an 
arc whenever (x, y) is. 

In this paper we shall always manipulate symmetric loopless directed graphs, which 
are really nothing but undirected simple graphs (an edge is identified with a pair of op- 
posite arcs). However, the directed symmetric representation allows us to handle more 
easily the notion of weak sense of direction and the related proofs. In turn, when using 
asymptotic enumeration results we shall confuse a symmetric loopless directed graph 
with its undirected simple counterpart. 

The (average) degree do of a graph G is | A | / 1 y | (or, in the undirected interpretation, 
twice the number of edges divided by the number of vertices). Of course, if G is regular 
(i.e., all vertices have the same number of incoming and outgoing arcs) then do is the 
(in- and out-)degree of every vertex, and one says that G is Jg - regular. 

A colouring of a graph G is a function k : A ^ where ^ is a finite set of 
colours; the map k"^ : A"^ ^ is defined by k"^(a\a 2 • • • ap) = k(a\)k(a 2 ) • • • k(ap). 
We write = {^((v, y)) | (x, y) e A} for the set of colours that x assigns to its 
outgoing arcs. 

Given a graph G coloured by k, let 

L= y {X*{n)\n eP[x,y]}- 

be the set of all strings that colour paths of G. 

A local naming for G is a family of injective functions p = {Px '• V 
with ^ a finite set, called the name space. Intuitively, each vertex x of G gives to each 
other vertex y a name Pxiy) taken from the name space. 

Given a coloured graph endowed with a local naming, a function / : L ^ ^ is a 
coding function iff 



Vx, y e y Vtt e P[x, y] f{k\7t)) = Px(y). 
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A coding function translates the colouring of the path along which two vertices x, 3; are 
connected into the name that x gives to 37. A colouring A is a weak sense of direction for 
a graph G iff for some local naming there is a coding function^ . We shall also say that 
a coloured graph has weak sense of direction, or that k gives weak sense of direction to 
G. 

3 Representing Regular Graphs Using Weak Sense of Direction 

A coding function / represents compactly a great deal of information about a graph, 
because / tells whether two paths with the same source have the same target. For in- 
stance, suppose that we want to exploit (naively) this property to code compactly a 
(strongly) connected regular graph G with weak of sense of direction. Assume without 
loss of generality that ^o(-^) = for all vertices x, that is, vertex 0 locally gives to 
all other vertices their “real names”. To code G, first specify for each vertex the set of 
colours of outgoing arcs. Then, give the values of / on every string of colours having 
length at most D 1 , where D is the diameter of G. 

To rebuild G from the above data, we proceed as follows: first of all we compute 
the targets of the arcs out of 0 using / on strings of length one, thus obtaining the set 
of coloured paths of length one going out of 0 . Then, since we know the colours of the 
arcs going out of the targets of such paths, we can build the set of coloured paths of 
length two out of 0, and compute their targets using / on strings of length two, and so 
on. Thus, we will eventually discover all arcs of G, using just the values of / on paths 
of length Z) -h 1 at most. Unfortunately this naive attempt is too rough, even for D = 2 , 
so we shall use a slightly more sophisticated approach. 

Let ^(w, k) be the class of all symmetric ^-regular graphs with n vertices that enjoy 
the following property, which we shall call property A3/2: 

If xi, X2, X3 are three distinct vertices such that X2 and X3 are adjacent, and 
XI is not adjacent to X2 and not adjacent to X3, then there exists a vertex z ^ 

{xi , X2, X3 } that is adjacent to x\ , X2 and X3 . 

Note that property A3/2 is weaker than property A3 of [ 2 ]; as we shall see, for suitable d 
almost all J-regular graphs enjoy property A3 /2, and nonetheless graphs satisfying A3 /2 
can be coded compactly. Intuitively for regular graphs A3/2 is a connectivity property 
slightly stronger than having diameter two, since given any pair of vertices x, y we 
can choose a vertex z adjacent to y (the existence of z is ensured by regularity) and 
apply A3/2, getting a vertex that, in particular, is adjacent both to x and to y. The same 
argument shows also that a regular graph satisfying A3/2 is connected. 

Lemma 1. Let G be a graph satisfying A 3/2. Let k be a sense of direction for G with 
name space y, coding function f and local naming p. Assume without loss of gener- 
ality that y’ ^ [n] and Po(x) = xfor all vertices x. Then, (x, y) is an arc iff one of the 
following holds: 

^ In [5] a slightly different definition is given, in whieh the empty string is not part of L. The 
results of this paper are not affeeted by this differenee. 
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- X = 0 and y = f{a)for some colour a e ^o; 

- X = f{a) and y = f(ab) for some a e JSfo (^nd b e ^f{a)> 

- X = f{ab) and y = f{ac) for some a e b,c e provided that there 

exists d e ^f{ab) such that f{bd) = /(c). 

Proof If a colours an arc going out of 0, the target of the arc is f{a). Moreover, if there 
is an arc with colour b going out of f{a), the target of such arc is f{ab). For the third 
case, ab and ac colour paths going out of 0 , whereas bd and c colour paths going out of 
f{a). Since f{bd) = f(c), the latter paths must have the same target; hence the path 
from 0 coloured ac has the same target as the path coloured abd. Therefore, there must 
be an arc, coloured d, from f{ab) to f{ac), as in Fig. 1. 




For the other side of the implication, consider an arc (x, y) of G. We have three 
cases: 

- if X = 0 , then it must correspond to an arc of the form ( 0 , f(a)} for some a e JSfo; 

- if X is an outneighbour of 0 , then x = f(a) for some a e JSfo, and the arc corre- 
sponds to an arc of the form {f(a), f(ab)} for some b e ^f{a)\ 

- finally, assume that x , y 7 ^ 0 and that moreover x and y are not outneighbours of 0 . 

By property A 3 / 2 , there exists a vertex z adjacent to x, y and 0. Let a be the colour 
of the arc going from 0 to z, Z? be the colour of the arc going from z to x, c be the 
colour of the arc going from z to y, and d be the colour of the arc from x to y. We 
have f(ab) = PoM = x, f{ac) = Po(y) = y and f(bd) = p^iy) = f(c) (see 
Fig. 1). □ 

Following the line of [2], we can use Lemma 1 to code compactly regular graphs as 
follows: 

Theorem 1. Let c = c(G) e N be such that every k-regular graph G with n vertices 
can be given weak sense direction using no more than c{G) colours. Then every graph 
in ^(n,k) can be described^ using 0{cn + c^ logn) bits. 

Proof. Let G e ^(n, k) have weak sense of direction with colouring I, name space 
local naming p and coding function /. Assume without loss of generality that ^ Vn] 

and ^o(-^) = for every vertex x. Describe G as follows: 

^ From now on, we shall sometimes omit the explieit dependenee of fimetions from their argu- 
ment, when the latter is elear from the eontext, thus writing c instead of c(G), d instead of 
d(n) and so on. 
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1 . give the number of colours c; 

2 . for every vertex x, use c bits to describe the set 

3. give the values of / on every string of length one or two. 

The first data require [log c] bits, the second one cn bits and the third one (c + [2 log 

(as we mentioned, [2 log bits are sufficient to specify a name). From the above de- 
scription, G can be recovered using Lemma 1 . □ 

4 A Result about Graph Enumeration 

The inspiration for this paper came out of a recent breakthrough by McKay and Wormald 
in asymptotic graph enumeration: 

Theorem 2 ([8]). Let d = d(n) and 8j = 8j(n), 0 < j < n, be such that min{ n — 
d — \ ] > cn/ log n for some c > |, YTjZl = 0, 8j = 0(1) uniformly over j, d + 8j 
is an integer for 0<j< n and dn is an even integer. Then the number of graphs with 
n vertices and local degrees d -\- 8o, d -\- 8\, . . . , d -\- 8n-\ is asymptotic to aM{n, d), 
where 




and Y\/n^ <a < y 2 for suitable positive constants s, y\ and y 2 . 

In other words, under the given hypotheses the order of magnitude depends essentially 
on the average degree only, and not on the specific degrees (but note that a in general 
will depend on n, on d and on the 8j ’s). The original result of McKay and Wormald is 
much more powerful, as it provides a precise asymptotic estimate for much more varied 
8j ’s, but the simplification above is sufficient for our purposes. 

The above theorem has the following consequence, whose (complex) proof is de- 
ferred to the last section: 

Theorem 3. Let d = o{n) satisfy the hypotheses of Theorem 2. Then almost all d- 
regular graphs satisfy A 3 / 2 . 

The statement “almost all J-regular graphs satisfy P” means that the number of d- 
regular graph of order n enjoying P divided by the number of all J-regular graphs of 
order n goes to 1 as n ^ 00 . Equivalently, if we consider the standard model of d- 
regular random graphs [3] in which all J-regular graphs of order n are equiprobable, 
we can say that the probability that a random graph satisfies P goes to 1 as n ^ 00 . 

Since under the given hypotheses the number of all J-regular graphs is asymptotic 
to the number of J-regular graphs satisfying A 3 / 2 , we can use Theorem 2 to get an 
asymptotic estimate of the size of the class ^(n,d), and thus a lower bound on the 
number of bits that are necessary to describe a graph belonging to it. 

Theorem 4. Let d = o{n) satisfy the hypotheses of Theorem 2. Then, the number of 
bits required to describe a graph in ^(n,d) is G{nd log{n/d)). 
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Proof. Since by Theorem 3 the number of graphs in d) is asymptotic to the num- 

ber of J-regular graphs, Theorem 2 tells us that the number of bits required is asymptotic 
to log[aM(n, d)\ If we expand the latter expression killing all terms that are 0 (nd) we 
obtain 



log[aM{n,d)] = -^(1 +d)log^— 
2 n — I 



- -(n-d)log 1 - 



n — I 



O (nd^ 



= 0 (^ndlog -h ^(n — d ) — -h 0 {nd) = 0 {ndXog .□ 



5 The Main Theorem 

We finally put together the upper and lower bounds we obtained: 

Theorems. If g{n) = o{n\og\ogn/ \ogn), it is impossible to give (weak) sense of 
direction to all regular graphs using do + g(n) colours. Moreover, it is impossible to 
give (weak) sense of direction to all regular graphs using o(dG^/Xog log do) colours.^ 

Proof. If g = 0 (n/\ogn), take any d = <^{n/\ogn) satisfying the hypotheses of 
Theorem 2 and note that by Theorem I 0 {n^ /logn^ bits would be sufficient to de- 
scribe a graph in ^(n,d), but by Theorem 4 0 (w^ log log nj log n) are required. Oth- 
erwise, we can write g = n f(n)/\ogn, with f(n) = o(\og\ogn), and take d = g. In 
this case 0 {n^f(n)^/ logn) = o{n^ f(n) log log w/ log w) bits would be sufficient, but 
0(n^/(w) log log n/ log n) are required. 

Finally, if h(m) = o(m^J\og logm) as m ^ oo take any d = 0 (w/logn); in 
this case O {h(d)^ / logn) = o{n^ loglogn / logn) bits would be sufficient, but again 
0 {n^ log log n / log w) are necessary. □ 

6 A Proof of Theorem 3 — Part I 

To prove that almost all J-regular graphs satisfy A3 /2, we show that almost no J-regular 
graph satisfies --As /2. The interesting feature of a J-regular graph G with n vertices that 
does not satisfy A3/2 is that it has a rather precise structure, displayed in Fig. 2 , where 
A3/2 does not work on xi, X2 and X3 (in Fig. 2 we draw only edges incident on xi, X2 
and X3). 

The vertices of G are partitioned into seven sets, depending on their adjacency rela- 
tions with the three vertices on which A3/2 does not work. If we strip xi, X2 and X3 we 
obtain a new “stripped graph” with n — 3 vertices and a rather precise degree assign- 
ment: clearly all vertices in V0 will have degree d, all vertices in the sets Vj will have 
degree d — I and all vertices in the sets Vij will have degree d — 2 . The key point is that 
such a degree structure still falls under the scope of Theorem 2 . Since, as we will show, 
the average degree d^ of a stripped graph is independent of the actual cardinalities of the 

^ That is, for every funetion /z(m) = oi^^log logm) there is a graph G sueh that /z(Jg) colours 
are not suffieient. 
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V ’s, we may hope to bound the number of counterexamples to A3/2 using M{n — 3,d^) 
to bound carefully the number of stripped graphs. To this goal, we work backwards and 
define a suitable kind of graph that can be enriched with three vertices so to obtain a 
J-regular counterexample to A3/2 of order n. 

An (n, d) -stripped graph is a graph S with n — 3 vertices, endowed with a vertex- 
colouring fimction tt : [w — 3] ^ and satisfying the following conditions: let 

us write V\ for the set of vertices coloured by { 1 }, Vi 2 for the set coloured by {1,2} 
and so on (formally, Vx = (X) for X c {1,2,3}), and finally let vx = I Vx I ; we 

require that 

deg(x) -h | 7 T (x) I = J for every vertex x of 5 
i;i23 = 0 
vi + Vi2 -h Vi3 = d 
^2 -h V12 -h V23 = d — 1 
V3 -h Vi3 -h V23 = d - 1 

The rationale behind the previous equalities is immediate, looking at Fig. 2. Note that, 
as a consequence. 



Vi -h V2 -h V3 -h 2(vu + 1^13 + 1 ^ 23 ) = 3d -2. 

We can associate to each (n, J)-stripped graph S a J-regular counterexample to A 3/2 





Figure 2. A generic counterexample to A 3 / 2 - 



with n vertices in the following manner: we add three new vertices n — 3,n—2 and n — l 
to S, and connect vertex n — y to vertex x < n — 3 if and only if y e 7r(x). Moreover, 
vertex n — 2 is adjacent to vertex n — 3. It is straightforward to observe that the graph 
constructed as above is J-regular (because of the condition deg(x) -h |7r(x)| = d). 
Moreover, vertices n — 3, n — 2 and n — l fail to satisfy property A3/2; conversely, 
every J-regular graph of size n whose last three vertices fail to satisfy A3/2 may be 
obtained from a suitable (n, J)-stripped graph using the above construction. Finally, we 
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remark that we can choose for each counterexample to A3/2 a relabelling that exactly 
exchanges the labels of the (say, lexicographically first) three vertices that break A3/2 
with those of last three vertices. As a result, we have that at most counterexamples to 
A3 /2 correspond to an (n, J)-stripped graph. Our next goal is thus to bound the number 
of (n, J) -stripped graphs, and to this purpose we state some simple properties of the 
variables v that are easily derivable from the linear system above: 

Lemma 2. Let S, tt define an {n, d)-stripped graph, k = n — 3 — V0 and s = v\2-\- r>i3; 



then: 

1 . the average degree of S is d^ = d — { 3 d — 2)1 {n — 3 ); 

2 . the following (in) equalities hold: 

V\2 -^Vu-^V 23 = 3 d - 2 -k ( 1 ) 

V2 = k — 2d H- v\3 H- 1 ( 2 ) 

y2 = k — d — V\2 I ( 3 ) 

\{ 3 d - 2)/21 <k< 3 d -2 ( 4 ) 

max( 0 , — 2 ^ — 2 ) < ^ < mm{d, 3 d — 2 — k) ( 5 ) 

max( 0 , 2d — k — \) < v\2 < min(^, k — 2d -\- s -\- \). ( 6 ) 

Proof 1 . The average degree of S is: 



, V 0 d + (i;i -\-V 2 -\- V 3 ){d - 1 ) + (vi 2 + 1^13 + V23)(d - 2 ) 

^ 

n — 3 

, vi + V2 + V3+2{vn + vi3 + V 23 ) , 3 d - 2 

n — 3 n — 3 

2 . Equation ( 1 ) directly follows from the constraints; hence we have V2 = d — v\2 — 
V23 — \ = k — 2 d -\- v\3 -\- 1 , proving ( 2 ). Moreover, v\ = d — v\2 — v\3, hence 
y^ -\- y2 = k — d — v\2 which is ( 3 ). For proving inequality ( 4 ), observe that 
k-\-v\2-\- v\3 + r >23 = 3 d — 2 implies k < 3 d — 2 \ moreover, since 2(i;i2 + ^13 + V22>) = 
3 d —2 — {v\-\-V2-\-V3),ssfQ\\mQk = {v\-\-V2-\-V3-\-3d—2)/2 > ( 3 J— 2 ) / 2 . For ( 5 ), first 
recall that V2 = k-2d-\-v\3-\-\, and similarly V3 = ^ — 2 J + i;i2 + 1 ; the nonnegativity 
constraints on V2 and V3 give v\2 > 2d — l—k (which is the only nontrivial lower bound 
for the remaining pair of inequalities) and v\3 > 2d — I — k, hence the lower bound 
on ^ = v\2 + v\3. On the other hand, s = 3 d — 2 — k — V23 < 3 d — 2 — k and also 
s = v\2 v\3 = d — v\ < d. Finally, for (6), we have v\2 = s — v\3 < s, and 

moreover, since v\3 > 2d — k — I, s = v\2 v\3 >2d — k—\-\- v\2, hence the bound 
i’12 ^ iS” “h A: “h 1 — 2 d. □ 

How can we bound the number of (n, J) -stripped graphs? Looking at the linear 
system above it is clear that once we choose values for k, s and v \2 within the bounds of 
Lemma 2 all other i;’s are uniquely determined, as i;i3 = ^ — vi2, V23 =3d — 2 — k — s, 
and the remaining values can always be computed (the system has maximum rank). 
Thus, the number of vertices to be assigned a certain colour is now fixed: we just have 
to choose which vertices will receive a certain colour. This can be done choosing first 
k vertices out of n — 3 (that is, the set of vertices with degree smaller than d)\ then 
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choosing the 3 J — 2 — ^ vertices out of k that will have degree d — 2; among the latter 
we must first choose the ^ vertices that belong to Vu U Vi 3 , and out of these the v \2 
vertices of Vi 2 ; then, among the k — (3d — 2 — k) =2k-3d-\-2 vertices of degree d — l 
we must choose the^ — J — r>i 2 + l vertices in Vi U V 2 , and finally out of these the J ^ 
vertices of V\ . Once also this choice is fixed, the bound of McKay and Wormald tells 
us that the number of graphs with the sequence of degrees given by the choices above 
is at most Y 2 M(n — 3,d^). All in all, we obtain the following horrendous-looking triple 
summation: 



3d— 2 mm{d,3d—2—k} mm{s,k—2d-\-s-\-l} 

E E E 

k=\(3d—2)/2'] s=max{0,4d—2k—2} v\2=max{0,2d—k—\} 



n — 3 
k 



3d 



3d -2- k 
s 



s 

1^12 



2k -3d + 2 
k — d — 1^12 1 



k — d — 1^12 1 

d — s 



2-kj 
Y2M(n-3,d^) 



However, things are not as bad as they may seem: the last factor is independent of all 
summation indices, and the first three binomials are independent of vu, so they can be 
moved out accordingly. Finally, applying trinomial revision^ to the last two binomials 
we remove a dependence on vu, getting to 



3d— 2 mm{d,3d—2—k} 

YiM{n-3,d') Y. E 

k=\{3d-2)/2-\ s=max{0Ad-2k-2] 

mm{s,k—2-d-\-s-\-l } 

E 



n — 3 
k 



3d -2- k\ f2k -3d + 2 



d — s 



v\2=rmx{0,2d—k —\ } 



3d - 2 -k 



2k — Ad s 2 



V\2 / \k — 2d “h iS" — 1^12 1 



Since we are interested in an upper bound, we can extend the last summation to 0 < 
v \2 <k — 2d and use Vandermonde convolution^. The resulting term is a 

central binomial coefficient of upper index 2k — Ad 2s 2, and can be bounded with 
2^k-4d+2s+2.^ the part independent of k and ^ can be moved out, getting to 



3d-2 m\n{d,3d-2^-k] / o\ / 7 

x: x: 

k=\(3d-2)/l]s=max[0Ad-7k-l] ^ ^ 

3d-2-k\ /2k -3d + 2\ 

s ){ d-s r 

There is not much more we can do about the summation term. The summation indices 
k and ^ appear almost everywhere, so we take a different approach: since the range 
of summation is extremely small (see Fig. 3) when compared to the summands, we 
can try to find an upper bound for the latter. To this aim, we study the behaviour of 
finite differences in k and ^ over the range of summation. This is a standard technique 

^ The trinomial revision theorem states that (^)(^) = {k)im^k ) — 

^ Vandermonde convolution: Ylk=0 iDin-k) ~ ibid.. 
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Figures. A look at the behaviour of Tn{s, k) for n = 100 and d = [n/ lognj . 



used when binomials are involved, as the sign of finite differences usually depends in a 
simple way on a low-degree polynomial. Indeed, if we let 



Tn{s,k) 



(n-i\( k \(3d-2-k\nk-3d + l\^ 
\ k )\id-2-k)\ s A d-s ) 



it is immediate to discover that 



Tnis, k + l)> Tnis, k) Knis, k) > 0 (7) 

T„{S + l,k)> Tnis, k) Snis, k) > 0, ( 8 ) 



where 

Knis, k) = i4d + 6- 4n)k -s^ + (5 + M- 4n)s + I2nd - 8 n - + 12 - I6d^ 

S„{s, k) = {2s-2- 4d)k + + {-\2d + 4)s + \2d^ -4d-2. 



Since both polynomials are linear in k (with ultimately negative coefficient), we can 
make conditions (7) and ( 8 ) explicit, obtaining two rational functions zk{s) and zs{s) 
such that the inequalities 



k 

k 



< zk{s) = 

< = 



1 5 ”^ “h (An — 8 ^/ — 5)iS' -\~ 8/1 — 12,nd — 12 -h 



I6d^ 



2n — 2d — 3 



1 3^2 (4 _ i2d)s -h I2d^ -3 -Ad 



are equivalent to (7) and ( 8 ), respectively. 

Armed with the knowledge above, we now try to answer to the following question: 
which conditions must a pair of integers {s, k) satisfy to be a local maximum of 
Clearly the strict version of both condition (7) and condition ( 8 ) must be false at {s,k), 
for the “next” integers on the plane must feature a smaller or equal value of Tn ; similarly, 
condition (7) must be true at ^ — 1 ) and condition ( 8 ) must be true at (^ — 1 , ^) . All 
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in all, we obtain the following set of constraints: 

k>ZK(s) k>zs(s) 

k < zk{s) + 1 k < zs{s - 1) 

The situation is depicted in Fig. 4 for w = 10^ and d = \n/\ogn\. The thicker curves 
represent the constraints involving zk, and the thinner ones the constraints involving 
zs- The region satisfying the constraints is the lozenge formed by the four curves (an 




Figure 4. The constraints on the local maxima of Tn 



easy check on the values of zk, zs and their derivatives on the range of summation 
shows that indeed this is always the case). The marked point at the intersection of zs 
and ZK is the only common zero of Kn and Sn in the range of summation, and can be 
easily computed with elementary techniques. Its coordinates are 








k = 3d- 




n 




( 9 ) 



Our goal now is to show that knowing k and s with the precision shown above is suffi- 
cient to know with the same precision the location of the global maximum of Tn over 
the pairs of integers in the summation range. In other words, we just have to show that 
that all integral points in the lozenge are not too far from {s,k). To this purpose, it is 
sufficient to give a rough estimate of the size of a rectangle containing the lozenge, for 
instance the rectangle defined by the upper and lower intersection points, which happen 
to be also the leftmost and rightmost, respectively. Theoretically it is possible to com- 
pute this points exactly, but unfortunately they are the unmanageable roots of two cubic 
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equations. However, standard algebraic manipulation shows that 



zs\s-Y-, 
ZK\s + Y^ 



d^ 
d^ 

zs[s + Y-^ 



1 d^ 

- 1 = t(K -2)^ -r + o 

2 

\ 1 

1 ) — -(y + 2)^ r o 



where r = s — Id} jn = 0{d^ /r}). In other words, we can choose a fixed y so 
that ultimately at distance yd? jf} to the left of s the thinner curves are both over the 
thicker ones, and conversely at distance yd^ jr? to the right. This shows that the width 
of the lozenge is 0{d? jr?). Finally, it is easy to check that ~ yd? jr?) — Zk(s -\~ 
yd^ /r?) = 0{d? jr?), so the lozenge is included in a rectangle whose sides are both 
0{d^ I r?). We conclude the global maximum on integers pair is attained at a point 
{s, k) satisfying 




n 







that is, modulo error terms 0{d^ I r?), the same as (9). As the reader will see, this will 
be sufficient for our purposes. 

We have thus finally reached our goal: our bound on the number of (n, J)-stripped 
graphs becomes 



K 2 M{n - 3, d?dl(3d - 2)/2JT;(^, k), 



and the proof that 



n\2M{n -3,d')d[{3d -2)/2}Tn{s,k) ^ 

(y\ln^)M(n, d) 

is now amenable to standard asymptotic techniques. In the next section we provide a 
full (and rather tedious) proof 



7 A Proof of Theorem 3 — Part II 



We start with a few obvious considerations: since we have exponentials around, we need 
to estimate the natural logarithm of our expression. In doing so, we plan in advance 
to consider only summands that are of order at least d? jr?'. in particular, the factor 
y 2 r?^^\i}d — 2)/2J/yi plays no role. We remark the following asymptotic relations 
between the principal quantities we will have to manage: 




d ^ d^ = o{n) d — d^ = — O 

n 



d^ _ ( \ 



d = Q 
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We start by computing the natural logarithm of M{n — 3, d^)/M{n, d)\ 



In 



M(n-3,d^) 



n — 3 



[ln(27T) + ln{n -3)-{n-2) \n{n - 4)+ 



M{n, d) 2 

“h {d^ -\- \)\wd^ -\- (ji — 3 — d^^ ln(/i — 4 — 

“h — [ln(27r) -\-\wn — (/i -h 1) ln(/i — (d l^hid (n — d^ ln(/i — 1 — ^/)] 



In- 



3 n 
nm — 



•4 , , (n-4)\n-l) , 3J - 2 , , 



. .M n-A-d' 3J-2, ^ ^ 

-\-{n — d) In H — ln(w — 4 — d ) 



+ 2 



n — I — d n — 3 
+ ln(w -3)-{n-2) \n{n - 4) + (J' + 1) In + (w - 3 - J') ln(w - 4 - J')] 



Since we are going to expand asymptotically all logarithms, we notice that 



{n-A)^{n - 1) 
{n-A- d^f 



7/2 



7/3 



. A 



— 1+3 h 6 — ^ + 10 — -z — h O I — Y 




SO we obtain 



In 1^1 
-\-{d + 1) In ( 1 — 



— 7z In ^ 1 
3d -2 



a 



7/3 



/ a' a' a' 

+ Ini 1 + 3 h 6 — ^ + 10 — -z — h O 



7/4 



1 7 \ ' n ' V tz4 



V d{n - 3) 



3d -2 
n — 3 



In d^ -\~ (ji — d^ In I 1 + 



d — d^ 



n — I — d 



+ 



3d -2 

H ln(7z — 4 — d) 

n — 3 



+ - [ln(27T) + ln(7z -3)-{n-2) \n{n - 4)+ 



+(J' + 1) In + (tz - 3 - J') ln(zz - 4 - J')] • 



Note that since ln(l x) = x — x^/2 + x^/3 + O(x^) for x ^ 0, if g = o{f) then 



2 3 / 4 \ 

ln(/ + g)_ln/ + + ^ + 

(/ + g)ln(/ + g) = (/+g)ln/+g + + -^ + 6>(^+ 
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Applying the expansions above and systematically killing all terms that are o {d^ I rp-) 
we obtain 



In 



M(n-3,d^) 
M(n, d) 



3n 









9d 



-/2 



7/3 



_n — \ 
3d -2 



+ 3 h 6 — 7T + 10 — -z — — — ;; — 18 — ^ + 9 — -z — 



n — 3 
3d -2 



lnJ' + 



2 

(n-d)(d -d^ -3) 3d -2 



(J + l)(3J-2) 
d(n - 3) 



+ 



+ ■ 



-\-(n-d^){- 



4 + 
n 

4 + 
n 

3 d^ \d^ 



n — I — d 

1 (4 + 

2 

1 ( 4 + 70 ^ 



+ 



■ Inn + 



3 

+ 2 



n — 3 

—n \nn + d^ In + (w — + 



1 (4 + 7')^ 



(d^ 



+ ol — 

\n^ 



— —3d + — h — — T + 3d In — \~ o{ — 

2 n 2 n 



where the last passage is just algebra, once one notes that it is possible to replace d^ 
with d inside logarithms (the resulting error is within our bound). 

We now approach the rest of the limit. We want to estimate the behaviour of 



Tnis.k) 



/n-3W k \(M-2-k\nk-M + 2\i^, 

\ k )\2d -2- k)\ s A d-s ) 

4^+^'(n-3)! 

~ s\{n - 3 - ky.Od -2-k-s)\{d- s)\{2k -4d + 2 + s)\ 



using the following asymptotic identity derived from Stirling approximation: 

ln[(/ + g)!] = (/ + ^) In / - / + + Oilnif + g)) , 



which is true when g = o{f), and always keeping in mind that 



n 



3d — k — s — — “h O 
n 



i 


^d^\ 


d^ j 


(d^\ 


+ o{ 


- 


s — 2 — “h 0 1 


- 


d^ \ 


vnV 


n ’ 

- _ d^ 


WJ 


1?) 




2k — 4^/ “h 5” = 2d — 4 — 
n 


+ e( 
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Applying systematically the identities above we obtain: 

In Tfi {k, s) = iji -\-s — 2d -\-\) ln4 -\- {n — 2)\nn — n — {n — k — 2)\nn -\- n-\- 
ik + 3)2 ik + 3)2 



2n 6«2 

o2 



(3d — 2 — k — 5')ln — H -h 

n n 



— (d — s^Xwd -\- d — — — — (2k — Ad -h 2 -h i?) ln(2^/) -\~ 2d -\~ 
2d 



1 



■ d^ ^fd^ 

—4 ^ 

n \n^ 



= (k A- s — 2d) ln4 + k Inn 



fd^ 

(^ + 3)2 



\n^ 



2n 



(k + 3)^ 
6n^ 



^In h3 h 

n n 



d^ ~s^ 

-(3d -k-~s)\n (d-~s)\nd- — -(2k-Ad + ~s) \n(2d) + 

n 2d 

o. o. 0.1 d 3d^ 3d^ fd^' 

-\~ 3d — 4 A~ o\ — — ) — 3d — 3d In — — — — — — oi — — 

2n^ V 



n 



2 n 






where again the last passage is just algebra (all logarithms cancel out happily). Finally, 
we put together everything, getting to 



3 d^ \d^ 



d 3 d^ 3 d^ 



3d A 1 T A~ 3d In — h 3d — 3d In ^ ^ I — r 



2 n 2n^ 



n 2 n 2n^ 



d^ 



— oo, 



as required. 
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Abstract. Rumor mongering (also known as gossip) is an epidemiolog- 
ical protocol that implements broadcasting with a reliability that can be 
very high. Rumor mongering is attractive because it is generic, scalable, 
adapts well to failures and recoveries, and has a reliability that gracefully 
degrades with the number of failures in a run. However, rumor mongering 
uses random selection for communications. We study the impact of using 
random selection in this paper. We present a protocol that superficially 
resembles rumor mongering but is deterministic. We show that this new 
protocol has most of the same attractions as rumor mongering. The one 
remaining attraction that rumor mongering has over the determinisitic 
protocol — namely graceful degradation — comes at a high cost in terms 
of the number of messages sent. We compare the two approaches both at 
an abstract level and in terms of how they perform in an Ethernet and 
small wide area network of Ethernets. 



1 Introduction 

Consider the problem of designing a protocol that broadcasts messages to all 
of the processors in a network. One can be interested in different metrics of a 
broadcast protocol, snch as the number of messages it generates, the time needed 
for the broadcast to complete, or the reliability of the protocol (where reliability 
is the probability that either all or no nonfanlty processors deliver a broadcast 
message and that all nonfanlty processors deliver the message if the sender is 
nonfanlty). Having fixed a set of metrics, one then chooses an abstraction for 
the network. There are two approaches that have been used in choosing snch an 
abstraction for broadcast protocols. 

One approach is to build upon the specific physical properties of the network. 
For example, there are several broadcast protocols that attain very high reliabil- 
ity for Ethernet networks [17] or redundant Ethernets [2, 5, 7]. Snch protocols 
can be very efficient in terms of the chosen metrics because one can leverage 

^ This research was support in part by DARPA grant N66001-98-8911 and NSF award 
CCR-9803743. Most of this work was done when Dr. Lin was a graduate student at 
UT Austin. 
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off of the particulars of the network. On the other hand, such protocols are not 
very portable, since they depend so strongly on the physical properties of the 
network. 

The other approach is to assume a generic network. With this approach, 
one chooses a set of basic network communication primitives such as sending 
and receiving a message. If reliability is a concern, then one can adopt a failure 
model that is generic enough to apply to many different physical networks. There 
are many examples of reliable broadcast protocols for such generic networks [14]. 
We consider in this paper broadcast protocols for generic networks. 

Unfortunately, many reliable broadcast protocols for generic networks do not 
scale well to large numbers of processors [4]. One family of protocols for generic 
networks that are designed to scale are called epidemiological algorithms or gos- 
sip protocols^ Gossip protocols are probabilistic in nature: a processor chooses 
its partner processors with which to communicate randomly. They are scalable 
because each processor sends only a fixed number of messages, independent 
of the number of processors in the network. In addition, a processor does not 
wait for acknowledgments nor does it take some recovery action should an ac- 
knowledgment not arrive. They achieve fault-tolerance against intermittent link 
failures and processor crashes because a processor receives copies of a message 
from different processors. No processor has a specific role to play, and so a failed 
processor will not prevent other processors from continuing sending messages. 
Hence, there is no need for failure detection or specific recovery actions. 

A drawback of gossip protocols is the number of messages that they send. 
Indeed, one class of gossip protocols (called anti-entropy protocols [9]) send an 
unbounded number of messages in nonterminating runs. Such protocols seem to 
be the only practical way that one can implement a gossip protocol that attains 
a high reliability in an environment in which links can fail for long periods 
of time [26]. Hence, when gossiping in a large wide-area network, anti-entropy 
protocols are often used to ensure high reliability. However, for applications that 
require timely delivery, the notion of reliability provided by anti-entropy may 
not be strong enough since it is based on the premise of eventual delivery of 
messages. 

Another class of gossip protocols is called rumor mongering [9]. Unlike anti- 
entropy, these protocols terminate and so the number of messages that are sent 
is bounded. The reliability may not be as high as anti-entropy, but one can trade 
off the number of messages sent with reliability. Rumor mongering by itself is 
not appropriate for networks that can partition with the prolonged failure of 
a few links, and so is best applied to small wide-area networks and local area 
networks. 

Consider the undirected clique that has a node for each processor, and let 
one processor p broadcast a value using rumor mongering. Assume that there are 
no failures. As the broadcast takes place, processors choose partners at random. 

^ The name gossip has been given to different protocols. For example, some authors 
use the term gossip to mean all-to-all communications, and what we describe here 
would be called random broadcasting in the whispering mode [21]. 
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For each processor q and a partner r it chooses, mark the edge from q to r. At 
the end of the broadcast, remove the edges that are not marked. The resulting 
graph is the communications graph for that broadcast. This graph should be 
connected, since otherwise the broadcast did not reach all processors. Further, 
the node and link connectivities of the communications graph give a measure of 
how well this broadcast would have withstood a set of failures. For example, if the 
communications graph is a tree, and if any processor represented by an internal 
node had crashed before the initiation of the broadcast, then the broadcast would 
not have reached all non-crashed processors. 

In this paper, we compare rumor mongering with a deterministic version of 
rumor mongering. This deterministic protocol superimposes a communication 
graph that has a minimal number of links given a desired connectivity. The 
connectivity is chosen to attain a desired reliability, and by being minimal link, 
the broadcast sends a small number of messages. This comparison allows us 
to ask the question what value does randomization give to rumor mongering! 
We show that the deterministic version does compare favorably with traditional 
rumor mongering in all but one metric, namely graceful degradation. 

We call the communications graphs that we impose Harary graphs because 
the construction we use comes from a paper by Frank Harary [15]. The deter- 
ministic protocol that we compare with rumor mongering is a simple flooding 
protocol over a Harary graph. 

The rest of the paper proceeds as follows. We flrst discuss some related re- 
search. We then describe gossip protocols and their properties. Next, we describe 
Harary graphs and show that some graphs yield higher reliabilities than others 
given a flxed connectivity. We then compare Harary graph-based flooding with 
gossip protocols both at an abstract level and using a simple simulation. 

Due to space restrictions, no theorems are proven in this paper. Interested 
readers can And all of the proofs in [20] . 



2 Related Work 



Superimposing a communications graph is a well-known technique for imple- 
menting broadcast protocols. Let an undirected graph G — (V^, E) represent 
such a superimposed graph, where each node in V is a processor and each edge 
in E means that the two nodes incident on the edge can directly send each other 
messages at the transport level. Two nodes that have an edge between them are 
called neighbors. A simple broadcast protocol has a processor initiate the broad- 
cast of m by sending m to all if its neighbors. Similarly, a node that receives 
m for the flrst time sends m to all of its neighbors except for the one which 
forwarded it m. This technique is commonly called flooding [1]. Depending on 
the superimposed graph structure, a node may be sent more than one copy of m. 
We call the number of messages sent in the reliable broadcast of a single m the 
message overhead of the broadcast protocol. For flooding, the message overhead 
is between one and two times the number of edges in the superimposed graph. 
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The most common graph that is superimposed is a spanning tree (for ex- 
ample, [10, 24]). Spanning trees are attractive when one wishes to minimize the 
number of messages: in failure-free runs, each processor receives exactly one mes- 
sage per broadcast and so the message overhead is |V^| — 1. Their drawback is 
that when failures occur, a new spanning tree needs to be computed and dissem- 
inated to the surviving processors. This is because a tree can be disconnected 
by the removal of any internal (ie. nonleaf) node or any link. 

If a graph more richly connected than a tree is superimposed, then not all sets 
of link and internal node failures will disconnect the graph. Hence, if a detected 
and persistent failure occurs, any reconfiguration — that is, the computation of a 
new superimposing graph — can be done while the original superimposed graph 
is still used to fiood messages. Doing so lessens the impact of the failure. 

One example of the use of a graph more richly connected than a tree is 
discussed in [11]. In this work, they show how a hypercube graph can be used 
instead of a tree to disseminate information for purposes of garbage collection. 
It turns out that a hypercube is a Harary graph that is three-connected. 

A more theoretical example of the use of a more richly connected graph 
than a tree is given by Liestman [18]. The problem being addressed in this work 
is, in some ways, similar to the problem we address. Like our work, they are 
interested in fault-tolerant broadcasting. And, like us, they wish to have a low 
message overhead. The models, however, are very different. They consider only 
link failures while we consider both link and node failures. They assume that a 
fixed unit of time elapses between a message being sent and it being delivered. 
They are concerned with attaining a minimum broadcast delivery time while 
we are not. And, the graphs that they superimpose are much more complex to 
generate as compared to Harary graphs. However, it turns out that some of the 
graphs they construct are also Harary graph. 

Similarly, the previous work discussed in the survey paper [21] on fault- 
tolerant broadcasting and gossiping is, on the surface, similar to the work re- 
ported here. The underlying models and the goals, however, are different from 
ours. Our work is about what kind of communication graphs to superimpose such 
that fiooding on such a graph masks a certain number of failures while sending 
the minimum number of messages. We also consider how the reliability degrades 
with respect to the graph structure when the number of failures exceeds what 
can be masked. Those earlier work, on the other hand, assumes messages trans- 
mitted or exchanged through a series of calls over the graph. The main goal is to 
compute the minimum time or its upper bound to complete a broadcast in the 
presence of a fixed number of failures, given different communication modes and 
failure models. The minimum number of calls that have to take place in order 
for a broadcast message to reach all processes is not always a metric of interest. 

The utility of the graphs described by Harary in [15] for the purposes of 
reducing the probability of disconnection was originally examined in [25]. This 
work, however, was concerned with rules for laying out wide-area networks to 
minimize the cost of installing lines while still maintaining a desired connectivity. 
It is otherwise unrelated to the work described in this paper. 
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3 Gossip Protocols 

Gossip protocols, which were first developed for replicated database consistency 
management in the Xerox Corporate Internet [9], have been bnilt to implement 
not only reliable multicast [4, 12] but also failure detection [23] and garbage col- 
lection [13]. Nearly all gossip protocols have been developed assuming the failure 
model of processor crashes and message links failures that lead to dropped mes- 
sages. Coordinated failures, such as the failure of a broadcast bus-based local 
area, are not usually considered. Such failures can only be masked by having re- 
dundant local area networks that fail independently (see, for example, [2, 5, 7]). 
And, they are not usually discussed in the context of synchronous versus asyn- 
chronous or timed asynchronous models [8, 14]. Like earlier work in gossip proto- 
cols, we do not consider coordinated failures or the question of how synchronous 
the environment must be to ensure that these protocols terminate in all runs. 

Gossip protocols have the following three features: 

1. Scalability Their performance does not rapidly degrade as the number of pro- 
cessors grow. Each processor sends a fixed number of messages that is indepen- 
dent of the number of processors. And, each processor runs a simple algorithm 
that is based on slowly-changing information. In general, a processor needs to 
know the identity of the other processors on its (local-area or small wide-area) 
network and a few constants. Hence, as long as the stability of the physical net- 
work does not degrade as the number of processors grow, then gossip is scalable. 

2. Adaptability It is not hard to add or remove processors in a network. In both 
cases, it can be done using the gossip protocol itself. 

3. Graceful Degradation For many reliable broadcast protocols, there is a value 
/ such that if there are no more than / failures (as defined by a failure model) 
then the protocol will function correctly. If there are / + 1 failures, however, 
then the protocol may not function correctly. The reliability of the protocol is 
then equal to the probability that no more than / failures occur. Computing 
this probability, however, may be hard to do and the computation may be based 
on values that are hard to measure. Hence, it is advantageous to have a protocol 
whose probability of functioning correctly does not drop rapidly as the number 
of failures increases past /. Such a protocol is said to degrade gracefully. One 
can build gossip protocols whose reliability is rather insensitive to /. 

There are many variations of gossip protocols within the two approaches 
mentioned in Section 1. Below is one variation of rumor mongering. This is the 
protocol that is used in [16] (specifically, F = 1 and the number of hops a gossip 
message can travel is fixed) and is called blind counter rumor mongering in [9]: 

initiate broadcast of m: 

send m to Birtmai neighbors 

when (p receives a message m from q) 

if (p has received m no more than F times) 
send m to F randomly chosen neighbors that 
p does not know have already seen m; 
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A gossip protocol that runs over a local-area network or a small wide-area 
network effectively assumes that the communications graph is a clique. There- 
fore, the neighbors of a processor would be the other processors in the network. 
A processor knows that another processor q has already received m if it has pre- 
viously received m from q. Processors can more accurately determine whether a 
processor has already received m by having m carry a set of processor identifiers. 
A processor adds its own identifier to the set of identifiers before forwarding m, 
and the union of the sets it has received on copies of m identify the processors 
that it knows have already received m. Henceforth in this paper, the gossip pro- 
tocol we discuss is this specific version of rumor mongering that uses this more 
accurate method. 

Since a processor selects it partners randomly, there is some chance that a 
message may not reach all processors even when there are no failures. However, 
in a clique, such probability is small [22]. Therefore, the reliability of gossip 
protocols is considered to be high. This is not the case when the connectivity of 
the network is not uniform, though. It has been shown that when the network 
has a hierarchical structure, a gossip message can fail to spread outside a group 
of processors [19]. 

It is often very difficult to obtain an analytical expression to describe the 
behavior of a gossip protocol. Often, the best one can get are equations that 
describe asymptotic behavior for simple network topologies. For more complex 
topologies or protocols, one almost always resorts to simulations. Our simulations 
show that the reliability does not drop too much for small values of / because 
a processor can receive messages from many processors. Also, the reliability of 
gossip rapidly increases with B ^ F. In general, for a given value of B ^ F, higher 
reliability is obtained by having a larger F (and therefore a smaller B). This is 
because a processor will have a more accurate idea of which processors already 
have m the later it forward m. 

Our simulation results also show that, as expected, the average number of 
messages sent per broadcast is bounded by BF{n — /). For a given value of 
F ' Bj fewer messages are sent for larger F. In this case, some processors learn 
that most other processors have already received the broadcast by the time they 
receive m F — 1 times, and so do not forward m to F processors. 

The left hand graph of Figure 1 illustrates how gracefully the measured reli- 
ability of gossip degrades as a function of / (the right hand graph is discussed in 
Section 4.3). This figure was generated using simulation with 10,000 broadcasts 
done for each value of / and having the / processors crash at the beginning of 
each run. The number of processors n is 32, F = 4, and F = 3. The measured 
reliability is 0.9999 for / = 3 crashed processors and is 0.869 for / = 16 crashed 
processors. Notice that the measured reliability is not strictly decreasing with 
/; when / is close to n, only a few processors need to receive the message for 
the broadcast to be successful. Indeed, when / = n — 1, gossip trivially has a 
reliability of 1. 
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Fig. 1. Reliability of gossip and of Harary Graph flooding. 



4 Harary Graphs 

In this section, we discuss the approach of imposing a Harary graph of certain 
connectivity on a network of processors and having each processor flood over 
that graph. The graph will be connected enough to ensure that the reliability of 
the flooding protocol will be acceptably large. 



4.1 Properties of Harary Graphs 

A Harary graph is an n— node graph that satisfies the following three properties: 

1. It is t”Uode connected. The removal of any subset of t — 1 nodes will not 
disconnect the graph, but there are subsets of t nodes whose removal dis- 
connects the graph. 

2. It is t-link connected. The removal of any subset of t — 1 links will not discon- 
nect the graph, but there are subsets of t links whose removal disconnects 
the graph. 

3. It is link minimal. The removal of any link will reduce the link connectivity 
(and therefore the node connectivity) of the graph. 

Let denote the set of Harary graphs that contains n nodes and has a 
link and a node connectivity of t. For example, is the set of all n— node 
trees, and 11 ^-, 2 is the set of all n— node circuits. Figures 2 show one graph in 
iJ 7 ^ 3 , two graphs in and one graph in 

Harary gave an algorithm for the construction of a graph in Hn,t foi* any 
value of n and t <n. [15] We denote this graph as and call it the canonical 
Harary graph. The algorithm is as follows: 

H^y. The tree with edges Vi : 0 < f < n — 1 : (f, f + 1). 
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Fig. 2. A graph of Hr, 3 two of and one of 



2 * The circuit with edges Vi : 0 < f < n : (i, i + 1 mod n). 

for t > 2 and even: First construct Then, for each value of m : 2 < 
m < t/2 add edges (i, j) where \i — j\ = rn mod n. 

^ for t > 2 and odd: First construct Then, connect all pairs (i, j) of 

nodes such that j — i = [n/2\. 

For most values of n and t, there are more than one graph in the set of 
graphs. All but the third graph in Figure 2 are canonical Flarary graphs, while 
the third graph in Figure 2 is not (it is the unit cube). 

It is not hard to see why Harary graphs are link-minimal among all t- 
connected graphs. For any graph G, the node connectivity k{G), link connectivity 
A(G), and minimum degree 6{G) are related: 

k{G) < A(G) < 6(G) (1) 

Harary showed that 5{G) is bounded by |"2£/n], where £ is the number of 
links [15]. Therefore, to have t node or link connectivity, the number of links has 
to be at least |’nt/2]. If G is a regular graph (that is, all of the nodes have the 
same degree), then k{G) = A(G) = 6{G) = t. A regular graph with nt/2 links 
is thus link-minimal among all graphs with t node or link connectivity. Harary 
graphs are such graphs when t is even or when n is even and t is odd. 

When both n and t are odd, Harary graphs are not regular graphs because 
there is no regular graph of odd degree with an odd number of nodes. Rather, 
there are n — 1 nodes of degree t and one node of degree t + 1. The number of 
links is |’nt/2]. They are link-minimal because removing any link will result in 
at least one node having its node degree reduced from t to t — 1. 



4.2 Overhead and Reliability of Flooding on Harary Graphs 

When n or t is even and assuming no failures, the overhead of flooding over a 
Harary graph is bounded from above by n(t — 1) + 1: the processor that starts 
the broadcast sends t messages and all the other processors send no more than 
t — 1 messages. When n and t are odd one more message can be sent because 
the graph is not regular. 
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As long as the communications graph remains connected, a flooding protocol 
is guaranteed to be reliable. Hence, we characterize reliability as the probability 
of disconnecting. We flrst consider a failure model that includes the failure 
of nodes — that is, processor crashes, and the failure of links — that is, message 
channels that can drop messages due to congestion or physical faults. We then 
consider only node failures. 

For local-area networks and small wide-area networks, one normally assumes 
that each link has the same probability pi of failing and each node has the same 
probability of failing, and that all failures are independent of each other. We 
do so as well. 

Since is both t-connected and t-link connected, one can compute an 
upper bound on the probability of being disconnected: the disconnection 
of requires x node failures and y link failures such that x + y > t. [3] The 
reliability r is thus bounded from below by the following: 

n—2 i 

E (2) 

y=max(t— a’,0) 

where i is the number of links in i = \nt/2]. This formula, however, is 

conservative since it assumes that all such failures of nodes and links disconnect 
the graph. For example, consider if 4^2 and assume that Pn — 0.99 and pe — 0.999 
(that is, each processor is crashed approximately 14 minutes a day and a link is 
faulty approximately 1.4 minutes a day). The above formula computes a lower 
bound on the reliability of 0.9993. If we instead examine all of the failures that 
disconnect iJ 4^2 and sum the probabilities of each of these cases happening, then 
we obtain an actual reliability of 0.9997. 

In general, computing the probability of a graph disconnecting given indi- 
vidual node and link failure probabilities is hard [6] , and so using Equation 2 is 
the only practical method we know of for computing the reliability of flooding 
on Hny- But, if one assumes that links do not fail then one can compute a more 
accurate value of the reliability. Note that assuming no link failures may not 
be an unreasonable assumption. A vast percentage of the link failures in small 
wide-area networks and local area networks are associated with congestion at ei- 
ther routers or individual processors. Hence, link failures rarely endure and can 
often be masked by using a simple acknowledgment and retransmission scheme. 

Consider the following metric: 

Definition!. Given a graph G € th® fragility F{G) of G is the fraction 
of subsets of t nodes whose removal disconnects G. 

For example, in the left graph of Figure 3, 187 of the 7,315 subsets of 4 nodes 
are cutsets of the graph, and so the graph has a fragility of 0.0256. The graph 
on the right, also a member of 7722,4? has a fragility of 22/7,315 = 0.0030. 

Since here we consider node failures only, pi is zero. We further assume that 
Pn is small. Thus, the probability of G disconnecting can be estimated as the 
probability of t nodes failing weighted by F{G): F(G)p^(l —pn)^^^\ So, 
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Fig. 3. Two g 




raphs of 



r<l^F{G)pUl^Pnr^*- (3) 

This is an upper bound on r because it does not consider disconnections arising 
from more than t nodes failing. But, when is small the contribution due to 
more than t nodes failing is small. 

Assume that each processor is crashed for five minutes a day, and so Pn — 
0.0035. From Equations 2 and 3 and setting pi = 0 we compute 0.999998989 < 
^ S 0.999999974 for fiooding in the left graph of Figure 3. In fact, the true 
reliability r computed by enumerating all the disconnecting cutsets and summing 
their probabilities of occurring is 0.999999973. Similarly, for the right-hand graph 
of Figure 3, we compute 0.999998989 < r < 0.999999997 and r is 0.999999997. 

4.3 Graceful Degradation 

Still assuming that pi — 0, one can extend the notion of fragility to compute 
how gracefully reliability degrades in fiooding on a Harary graph. 

Definition 2. Given a graph G € F(G, /) is the fraction of subsets of / 
nodes whose removal disconnects G. 

Hence, F(G, t) = F(G) and F(G,/) = 0for0</<torn — !</<n. 
On the condition of any subset of / nodes having failed, the graph will remain 
connected with the probability of 1 — F(G,/). Thus, we can use 1 — F(G,/) 
as a way to characterize the reliability of fiooding on a Harary graph G. This 
is much like the way we measured the reliability of gossip as a function of / as 
illustrated in Figure 1. There, the reliability was measured after / nodes have 
crashed. 
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The graphs of 1 — F(G,/) for the two graphs of Figure 3 is shown in the 
right hand of Figure 1. As can be seen, the less fragile graph also degrades more 
gracefully than the canonical graph. 



4.4 Bounds on Fragility 

Since the upper bound on reliability and how gracefully reliability degrades de- 
pend on the fragility of a Harary graph, we examine some fragility properties of 
canonical Harary graphs. We show that canonical Harary graphs can be signifi- 
cantly more fragile than some other families of Harary graphs, and describe how 
to construct these less fragile graphs. 

We define a t— cutset of a graph G to be a set of t nodes of G whose removal 
disconnects G. And, given a subgraph S of G, we define the joint neighbors of 
S to be those nodes in G — S' that are a neighbor of a node in S. The two ideas 
are related: if S is connected, A are the joint neighbors of S, and A U S C G, 
then A is a | A cutset of G. 

is an n— clique and so F{Hn,n-h f ) = 0 for all / between 0 and n. 
Hence, in the following we assume that t < n — 1, 

Theorem 3. For even t > 2 and n > t + 2, the number of subsets of t nodes 
that diseonneets is n{n — t — l)/2. 

We can construct a member of Hn^t for even t and n > 2t that are less fragile 
than These graphs, which we call modified Harary graphs and denote with 
j have n distinct t— cutsets. The right hand graph of Figure 3 is while 

the left hand graph is i? 22 , 4 * algorithm for constructing is as follows: 

First construct 2 * Tfion, for each value of m: 2 < m < t/2 add edges 
(f, j) where \i — j\ = (m + 1) mod n. 

Theorem 4. is in When n > 2t, has n t—eutsets. 

It is not hard to find examples of Harary graphs that have fewer than n 
t— cutsets. [20] For arbitrary n and t, however, it remains an open problem of 
what is the minimum number of cutsets of size t for graphs in 

Canonical Harary graphs have a better fragility when t is odd:, which 

Theorems. For with even n > 6 and t = 3, there are n cutsets of size t. 

Theorem 6. For with odd n > 7 and t = 3, there are n cutsets of size t. 

Theorem 7. For with even n>t + 2 and odd t > 5, there are n cutsets of 

size t. For with odd n > t + 2 and odd t > 5, there are n cutsets of size t. 

When n and t are both odd, only n — 1 nodes have t neighbors; one node 
has t + 1 neighbors. Hence, it may be surprising that for odd n and t has n 
t— cutsets. It is not hard to construct graphs in for odd n and t that have 
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n — 1 t— cutsets, that has 8 3— cutsets. The improvement in fragility for such 
graphs, however, is small. 

For a fixed t > 2, the fragility decreases with increasing n. This is 

because the number of cutsets of size t grows either linearly or quadratically 
with n, while (^) is of 0(n*). Similarly, decreases with increasing n. 

The same observations hold for and when n is fixed and t is 

increased. 

5 Comparing Harary Graph-Based Flooding and Gossip 

A simple comparison of fiooding over a Harary graph and gossip based on the 
following four criteria indicates that each has its own strengths: 

1. Scalability For a small wide-area network and local area network, the two 
protocols are equivalent. Both protocols run simple algorithms that are based 
on slowly-changing information. Both require a processor to know the identity 
of the other processors on its (small wide-area or local area) network and a small 
amount of constant information (the rule used to identify neighbors versus the 
constants B and F). And, in both protocols the number of messages that each 
processor sends is independent of n. 

2. Adaptability The fiooding protocol requires the processors to use the same 
Harary graph. Since each processor independently determines its neighbors, it 
might appear that gossip is more adaptable than fiooding over a Harary graph. 
From a practical point of view, though, we expect that they are similar in adapt- 
ability. Given the controlled environment of local-area networks and small wide- 
area networks, it is not hard to bound from below with equivalent reliabilities the 
time it takes for each protocol to terminate. Then, one can use reliable broad- 
cast based on the old set of processors to disseminate the new set of processors. 
In addition, adding or removing a single processor causes only t processors to 
change their neighbors in Harary graph fiooding. If t is small, then a simple (and 
non-scalable) agreement protocol can be used to change the set of processors. 

3. Graceful Degradation Gossip degrades more gracefully than Harary graph 

fiooding. Figure 4 illustrates this for n = 22. The Harary graphs used here are the 
1^22,4 4, and the gossip protocols use F = 3, F = 2 and F = 2, F = 2, 

both with B initial — B. We compare the degradation of reliability of these 
protocols because they have similar message overheads. Note that while the 
Harary graph fiooding yields a higher reliability for small / even when f > tj 
both gossip protocols have a higher reliability for / > 10. 

4. Message Overhead Since Hn^t has a minimum number of links while remaining 
t-connected, it is not surprising that Harary graph fiooding sends fewer messages 
than gossip. For example, given n = 32, fiooding on F324 and gossip with F = 4 
and F — 3 provide similar reliabilities given processors that are crashed for five 
minutes a day. Gossip sends roughly BFf{t — 1) = 4 times as many messages as 
Harary graph fiooding. 

To make a more detailed comparison of message overhead, we evaluated the 
performance of gossip and Harary graph fiooding using the ns simulator [27]. 
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Canonical H(22,4) 
Modified H(22,4) 
Gossip B=3, F=2 
Gossip B=2, F=2 



Fig. 4. Graceful degradation of gossip and flooding on ^ 22,4 



We simulated Ethernet-based networks. One of the networks we considered was 
a LAN of a single Ethernet, where there are 32 processors. The other was a 
small WAN of three Ethernets pairwise connected, where each Ethernet has 21 
processors, one of them also acting like a router. 

For the single LAN we imposed the 4 graph on the processors, and for the 
small WAN was imposed where processors 0, 21 and 42 were the routers 
connecting the three LANs. Thus on each LAN, the router has two neighbors 
on a different LAN and another processor has one neighbor outside of the LAN. 
We compared flooding on these Harary graphs with gossip where E = 4, F = 3, 
and Bjjijf/iQi — B. 

We obtained the properties of an Ethernet based on those of a common 
Ethernet for LANs. For the single LAN, we assume a bandwidth of 10 Mbps, 
and for the small WAN, we assumed that the links between routers of the LANs 
have a bandwidth of 1 Mbps and a delay of 10 ms. The ns simulator followed 
the Ethernet specifications and provided the low-level details. Each message was 
contained in a IK-byte packet. We did not consider failures in these simulations. 
All the results were computed as the average of 100 broadcasts. 

We found that gossip initially delivers broadcasts faster. However, it takes 
longer for the last few processors to receive the message. Therefore, it takes 
longer for gossip to complete a broadcast than for Harary graph fiooding. For 
both protocols, there were a fair amount of collisions because both protocols 
make intensive use of the network. These collisions increased the completion 
time for both protocols. We also found that the packt fiow of Harary graph 
fiooding diminishes much more quickly than that of gossip. 

In the small WAN, Harary graph fiooding imposes a much smaller load on 
the routers than gossip does. We measured the number of packets that were 
sent across the LANs. With gossip, an average of 169 packets were sent between 
each pair of LANs. With Harary graph fiooding, only 6 packets went across 
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because processors on each LAN have a total of three neighbors on another LAN. 
This kind of overloading of routers with redundant messages happens when the 
network topology is not taken into account [19, 23]. 

A more detailed discussion of these simulation results can be found in [20]. 

6 Conclusions 

Gossip protocols are often advertised as being attractive because their simplicity 
and their use of randomization makes them scalable and adaptable. Further- 
more, their reliability degrades gracefully with respect to the actual number of 
failures /. We believe that, while these advertised attractions are valid, Harary 
graph flooding also provides most of these attractions with a substantially lower 
message overhead. Furthermore, Harary graph flooding appears to be faster than 
gossip for broadcast-bus networks. Hence, for local-area networks and small wide- 
area networks, the only beneflt of gossip is that its reliability more gracefully 
degrades than Harary graph flooding. 

This remaining advantage of gossip, however, should be considered carefully. 
While it is true that gossip more gracefully degrades than Harary graph flooding, 
the reliability of flooding over decreases only slightly for values of / slightly 
larger than t. Hence, Harary graph flooding provides some latitude in computing 
t. And, both suffer from reduced reliability as / increases further. One improves 
graceful degradation of gossip by increasing F (or, to a slightly less degree, B). 
Doing so increases the message overhead and network congestion, and therefore 
the protocol completion time. 
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Abstract. We consider the problem of generic broadcast in asynchronous 
systems with crashes, a problem that was first studied in [12]. Roughly 
speaking, given a ‘‘confiict” relation on the set of messages, generic broad- 
cast ensures that any two messages that confiict are delivered in the same 
order; messages that do not confiict may be delivered in different order. 
In this paper, we define what it means for an implementation generic 
broadcast to be ‘‘thrifty” , and give corresponding implementations that 
are optimal in terms of resiliency. We also give an interesting application 
of our results regarding the implementation of atomic broadcast. 



1 Introduction 

Atomic broadcast is a well-known building block of fault-tolerant distributed 
applications (e.g., see [7,4,9,8,10,3,2]). Informally, this communication primi- 
tive ensures that oM messages broadcast are delivered in the same order. In a 
recent paper, Pedone and Schiper noted that for some applications some mes- 
sages do not “conflicF^ with each other, and hence they can be delivered by 
different processes in different orders [12]. For such applications, the broadcast 
communication primitive does not need to order all messages; it must order only 
the conflicting ones. An example given in [12] consists of read and write mes- 
sages broadcast to replicated servers, where read messages do not conflict with 
each other, and hence do not have to be ordered. Intuitively, one may want to 
avoid ordering the delivery of messages unless it is really necessary: such ordering 
may be expensive, or even impossible unless one uses oracles such as as failure 
detectors, and these can be unreliable. 

In view of the above, Pedone and Schiper proposed a generalized version of 
atomic broadcast, called generie broadeast. Informally, given any eonfliet rela- 
tion deflned over the set of messages, if two messages m and conflict, then 
generic broadcast ensures that they are delivered in the same order.^ Messages 
that do not conflict are not required to be ordered. Note that if the conflict rela- 
tion includes all the pairs of messages, generic broadcast coincides with atomic 

^ Research partially supported by NSF grants CCR-9711403. 

^ The conflict relation is a parameter of generic broadcast. We assume that it is sym- 
metric and non-reflexive. 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 268-282, 2000. 

(c) Springer- Verlag Berlin Heidelberg 2000 
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broadcast. On the other hand, if the conflict relation is empty, generic broadcast 
reduces to reliable broadcast. 

How can one implement a generic broadcast primitive? A trivial way is to 
use atomic broadcast to broadcast every message that we want to gbroadcast.^ 
This ensures that all messages are ordered, including non-conflicting ones. Such 
an implementation is unsatisfactory, and goes against the motivation for intro- 
ducing generic broadcast in the flrst place. To avoid this trivial implementation, 
and in order to characterize “good” implementations, Pedone and Schiper intro- 
duced the notion of strietness. Roughly speaking, an implementation of generic 
broadcast is strict if it has at least one execution in which two processes deliver 
two non-conflicting messages in a different order. The notion of strictness is in- 
tended to capture the intuitive idea that the total order delivery of messages 
has a cost, and this cost should be paid only when necessary. As Pedone and 
Schiper point out in [13], however, the strictness requirement is not sufficient 
to characterize good implementations of generic broadcast. Intuitively, this is 
because there is a strict implementation that flrst orders all the messages, in- 
cluding non-conflicting ones, and then selects two non-conflicting messages and 
delivers them in different orders. Even though such an implementation is strict, 
it goes against the motivation behind generic broadcast. 

In this paper, we reconsider the question of what it means for an implemen- 
tation of generic broadcast to be good, and we propose new deflnitions. We flrst 
note that in asynchronous systems with crash failures (the systems considered 
in [12] and here), generic broadcast cannot be implemented without the help 
of an “oracle” that can be used to order the delivery of messages that conflict. 
This oracle could be a “box” that solves atomic broadcast or consensus; or it 
could be a failure detector that can be used to implement such a box. In the 
flrst case, this oracle is expensive; in the second case, it can be unreliable and 
its mistakes can slow down the delivery of messages.^ In either case, one should 
avoid the use of the oracle whenever possible. Thus, a good implementation of 
generic broadcast is one that takes advantage of the fact that only conflicting 
messages need to be ordered, and uses its oracle only when there are conflicting 
messages that are actually broadcast. 

This leads us to the following deflnition. Roughly speaking, an implementa- 
tion of generic broadcast is nan-trivial w.r.t. an oraele, if it satisfles the following 
property: if all the messages that are actually broadcast do not conflict with each 
other, then the oracle is never used. A non-trivial implementation, however, is 
still unsatisfactory: even in a run where there is only one broadcast that con- 
flicts with a previous one, such an implementation is allowed use its oracle an 



^ Henceforth, gbroadcast and gdeliver are the two primitives associated with generic 
broadcast. Similarly, abroadcast and adeliver are associated with atomic broadcast. 

^ Even though one can implement failure detectors that are fairly accurate in prac- 
tice [14, 6], they may have ‘Tad” periods of time when they make too many mistakes 
to be useful. For example, from [5] there is an atomic broadcast algorithm that never 
deliver messages out of order, but message delivery is delayed if/when the algorithm 
happens to rely on the failure detector during one of its bad periods. 
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unlimited number of times. This motivates our second definition. An implemen- 
tation of generic broadcast is thrifty w.r,t an oracle if it is non-trivial and it 
also satisfies the following property: if there is a time after which the messages 
broadcast do not confiict with each other, then there is a time after which the 
oracle is not used. It is easy to see that non-trivial implementations and thrifty 
ones are necessarily strict in the sense of [12]. 

In this paper, we consider implementations of generic broadcast that use 
atomic broadcast as the oracle. Atomic broadcast is a natural oracle for the task 
of totally ordering confiicting messages. Furthermore, any implementation that is 
thrifty w.r.t. atomic broadcast can be transformed into an implementation that 
is thrifty w.r.t. consensus. It can also be transformed into an implementation 
that is thrifty w.r.t. 05, the weakest failure detector that can be used to solve 
generic broadcast (this last transformation assumes that a majority of processes 
is correct). 

We present two implementations of generic broadcast: one is non-trivial and 
the other is thrifty. The non-trivial implementation is simple and illustrates some 
of our basic techniques; the thrifty implementation is more complex and builds 
upon the simple implementation. Both implementations work for asynchronous 
systems with n processes where up to / < n/2 may crash, which is optimal. 
Since both implementations are also strict, this improves on the resiliency of the 
strict implementation given in [12] which tolerates up to / < n/3 crashes. 

We continue the paper with an interesting use of thrifty implementations of 
generic broadcast. Specifically, we show how they can be used to derive ‘^paring’^ 
implementations of atomic broadcast, as we now explain. First note that in 
asynchronous systems with failures, any implementation of atomic broadcast 
requires the use of an external oracle, and (just as with generic broadcast) it 
is better to avoid relying on this oracle whenever possible. For example, if the 
oracle is a failure detector, relying on this oracle during one of its “bad” period 
can delay the delivery of messages. So we would like an implementation of atomic 
broadcast that uses the oracle sparingly. How can we do so? 

Suppose a process atomically broadcast m and then mb No oracle is needed 
to ensure that m and m' are delivered in the same order everywhere: FIFO order 
can be easily enforced with sequence numbers assigned by the sender. Similarly, 
suppose two atomic broadcast messages happen to be causally related^, e.g., 
m is adelivered by a process before it abroadcasts mb Then, we can order the 
delivery of m and m^ without any oracle (this can be done with message pig- 
gybacking or “vector clocks”; see for example [10]). Thus, an implementation 
of atomic broadcast can reduce its reliance on the oracle, by restricting its use 
to the ordering of broadcast messages that are concurrent We say that an im- 
plementation of atomic broadcast is sparing w,r,t, an oracle^ if it satisfies the 
following property: If there is a time after which the messages broadcast are 
pairwise causally related, then there is a time after which the oracle is not used. 



^ We say that two messages are causally related or concurrent, if their broadcast events 
are causally related or concurrent, respectively, in the sense of [11, 10]. 
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We conclude the paper by showing how to transform any implementation of 
atomic broadcast that uses some oracle, into one that is sparing w.r.t the same 
oracle. To do so, we use a thrifty implementation of generic broadcast and vector 
clocks. 

As a final remark, note that Pedone and Schiper use message latency as a way 
to evaluate the efficiency of generic broadcast implementations. In “good” runs 
with no failures and no confiicting messages, their generic broadcast algorithm 
ensures that every message is delivered within 26 (assuming 6 is the maximum 
message delay). In this paper, our focus was not on optimizing the latency of 
messages in these good runs, but rather on reducing the dependency on the oracle 
whenever possible. These two goals, however, are not incompatible. In fact, we 
can modify our thrifty implementations of generic broadcast to also achieve a 
small message latency in good runs. Specifically, we have an implementation 
that assumes / < n/3 and ensures a message delivery within 26 in such runs (as 
in [12]). We also have an implementation that works for / < n/2 and ensures 
message delivery within 36 in good runs. It is worth noting that even in runs 
with failures and eonflicting messages, the message delivery times of 26 and 36, 
respectively, are eventually achieved provided there is a time after which the 
messages broadcast are not confiicting. 

In summary, this paper considers the problem of generic broadcast in asyn- 
chronous systems with crashes, a problem that was first studied in [12]. We first 
propose alternative definitions of “good” implementations of generic broadcast 
(the previous definition in terms of “strictness” had some drawbacks). Roughly 
speaking, we consider an implementation to be good if it does not rely on any 
oracle when the messages that are broadcast do not confiict. We then give two 
such implementations (with atomic broadcast as its oracle): one does not use 
the oracle in runs where no messages confiict, and the other one stops using 
the oracle if confiicting broadcasts cease. Both implementations are optimal in 
terms of resiliency; they tolerate up to /< n/2 process crashes (an improvement 
over [12]). We then use our results to give “sparing” implementations of atomic 
broadcast, i.e., implementations that stop using their oracle if concurrent broad- 
casts cease. Finally, we show how to transform any implementation of atomic 
broadcast into a sparing one. 

In this extended abstract, we omit the proofs (they are given in the full paper 

[I])* 

2 Informal Model 

We consider asynehronous distributed systems. To simplify the presentation of 
our model, we assume the existence of a discrete global clock. This is merely a 
fictional device: the processes do not have access to it. We take the range r of 
the clock^s ticks to be the set of natural numbers N. 

The system consists of a set of n proeesses, U = {1,2, ...,n} and an or- 
acle. Processes are connected with each other through reliable asynchronous 
communication channels. Up to / processes can fail by erashing. A failure pat- 
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tern indicates which processes crash, and when, during an execution. Formally, 
a failure pattern F is a function from r to 2^, where F{t) denotes the set of 
processes that have crashed through time t. Once a process crashes, it does not 
‘Recover”, i.e., Vt : F{t) C F(t + 1). We define erashed{F) = \Jf,^^F{t) and 
eorreet{F) = 11 — erashed{F). If p € erashed{F) we say p erashes (in F ) and if 
p e eorreet{F) we say p is eorrect (in F). 

A distributed algorithm v4 is a collection of n deterministic automata (one 
for each process in the system). The execution of A occurs in steps as follows. 
For every time t € r, at most one process takes a step; moreover, every correct 
process takes an infinite number of steps. In each step, a process (1) may send a 
message to a process; (2) queries the oracle (the query may be ±); (3) receives 
an answer from the oracle (possibly ±); (4) receives a message (possibly ±); and 
(5) changes state. We say that a proeess uses the oraele at time t if it performs 
a nom± query at time t. 

An oracle history is a sequence of quadruples (p, t, f, o), where p is a pro- 
cess, t is a time {t is monotonically increasing in iJ), i is the query of p at time 
tj and o is the answer of the oracle to p at time t. We assume that if no pro- 
cess ever uses the oracle (all queries in H are ±) then the oracle never gives 
any answer (all answers in H are ±). An oracle O is function that takes a fail- 
ure pattern F and returns a set 0{F) of oracle histories^. Oracles of interest 
include failure detectors [5], an atomic broadcast black-box, and a consensus 
black-box. For example, an atomic broadcast black-box can be modeled as an 
oracle that accepts “broadcast (m)” queries, and outputs “deliver (m)” answers, 
where the queries/answers satisfy the usual specification of atomic broadcast 
(see Section 2.2). 



2.1 Reliable broadcast 

Intuitively, reliable broadcast ensures that processes agree on the set of messages 
that they deliver. Alore precisely, reliable broadeast is defined in terms of two 
primitives: rbroadeast{m) and rdeliver{m). We say that process p broadeasts 
message m if p invokes rbroadeast{rri) . We assume that every broadcast message 
m includes the following fields: the identity of its sender, denoted sender{rn), and 
a sequence number, denoted seq{rn). These fields make every message unique. 
We say that q delivers message m if g returns from the invocation of rdeliver{m). 
Primitives rbroadeast and rdeliver satisfy the following properties:^ 

Validity: If a correct process broadcasts a message m, then it eventually delivers 

m. 

Uniform Agreement: If a process delivers a message m, then all correct processes 
eventually deliver m. 

Uniform Integrity: For every message rUj every process delivers m at most once, 
and only if m was previously broadcast by sender{m). 

^ We assume this set allows any process to make any query at any time. 

® All the broadcast primitives that we define in this paper are uniform [10]. To abbre- 
viate the notation, we drop the word ^'uniform” from the various broadcast types. 
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Validity and Uniform Agreement imply that if a correct process broadcasts 
a message m, then all correct processes eventually deliver m. 

2.2 Atomic broadcast 

Intuitively, atomic broadcast ensures that processes agree on the order they de- 
liver messages. More precisely, atomic broadcast is defined in terms of primitives 
abroadcast{m) and adeliver{m) that must satisfy the Validity, Uniform Agree- 
ment and Uniform Integrity properties above, and the following property: 

Uniform Total Order: If some process delivers message m before message 
then a process delivers nd only after it has delivered mT 

2.3 Generic broadcast 

Generic broadcast is parametrized by a conflict relation (denoted defined over 
the set of messages; this relation is assumed to be symmetric and non-refiexive. 
Informally, generic broadcast ensures that if two messages m and nd confiict, 
then they are delivered in the same order. Messages that do not confiict are not 
required to be ordered. More precisely, generic broadcast is defined in terms of 
the confiict relation (given as a parameter) and two primitives: gbroadcast{rri) 
and gdeliver{rn) that must satisfy the Validity, Uniform Agreement and Uniform 
Integrity properties above, and the following property: 

Uniform GeneraMzed Order: If messages m and confiict and some process 
delivers m before then a process delivers nd only after it has delivered 

m. 

If the confiict relation includes all the pairs of messages, generic broadcast co- 
incides with atomic broadcast; if the confiict relation is empty, generic broadcast 
reduces to reliable broadcast. 

3 Thrifty implementations 

Let A be an implementation of generic broadcast that can use an oracle V, and 
let Runs{A) be the set of runs of A. Let gbcast-msgs{r) be the set of messages 
gbroadcast in r and gbcast-msgs{r^ [Leo)) be the set of messages gbroadcast in 
r at or after time t. 

Definition 1. We say that A is non-trivial w.r.t. oracle when no con- 

flicting messages are gbroadcast, A^ is not used. More precisely: Vr € Runs{A)j 
[Vm,m^ € gbeast-msgs{r).^m A ^ ^ Is not used in r. 

Definition 2. We say that A is thrifty w.r.t. A^ if it is non-trivial w.r.t. X 
and it guarantees the following property: if there is a time after which messages 
gbroadcast do not conflict with each other\ then eventually X is no longer used. 
More precisely: Vr € i^wn5(M), [3t, Vm,m^ € gbeast-msgs{rfltjOo))jm A ^ 
3td^X is not used in r after time f . 

^ In [10], Uniform Total Order is a weaker property. 
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4 A non-trivial implementation of generic broadcast 

We now give a non-trivial implementation of generic broadcast for asynchronous 
systems with a majority of correct processes. The implementation, given in Fig- 
ure 1, uses atomic broadcast as an oracle, and reliable broadcast as a subroutine 
(reliable broadcast can be easily implemented in asynchronous systems with pro- 
cess crashes without the use of oracles). In this implementation, C{m) denotes 
the set {m} U conflicts with m}. 

To gbroadcast a message m, the basic idea is that processes go through two 
rounds of messages, and then the broadcaster p decides to either r broadcast m 
(in which case the oracle is not used) or abroadcast m (in which case the oracle 
is used). More precisely, to gbroadcast a message m, p sends (m, first) to all 
processes, where first is a tag to distinguish different types of messages. When 
a process receives (m, first), it adds m to its set seen of messages, and checks if 
m conflicts with any messages in seen. If it does, it sends (m, bad, second) to all 
processes; else, it sends (m, GOOD, second). When a process receives a message 
of form (m, *, second) from n — f processes, it adds m to its seen set, and then 
checks if a majority of SECOND messages are GOOD, and if its seen set has no 
messages conflicting with m. If so, the process adds m to its set possibleRB, and 
then sends {rn^ possibleRB D C(m), third) to p — the process that gbroadcast 
m — where possibleRB D C{m) is the subset of messages in possibleRB that 
either conflict with m or is equal to m (note that possibleRB D C{m) can be 
empty, it can contain m, and it can contain messages distinct from m) . When p 
receives messages of the form (m, third) from n — f processes, it checks if a 
majority of them has m it its pass set. If so, p rbroadcasts m; else, p abroadcasts 
ruj together with the union of all pass sets received. When a process rdelivers 
ruj it gdelivers m if it has not done so previously. When a process adelivers 
(m,prec), it gdelivers all messages in pree (if it has not done so already), and 
then gdelivers m. 

In this implementation, each process keeps two local variables: seen and pos- 
sibleRB. The flrst one keeps the set of gbroadcast messages that the process 
has seen so far, and the second keeps the set of gbroadcast messages that are 
possibly reliably broadcast. 

Theorem 1. Consider an asynchronous system with a majority of correct pro- 
cesses (n > 2f). The algorithm in Figure 1 is a non-trivial implementation of 
generic broadcast that uses atomie broadeast as an oraele. 

Observation; In asynchronous systems with n < 2/, there are no non-trivial 
implementations of generic broadcast w.r.t. any oracle X. 



5 A thrifty implementation of generic broadcast 

We now give a thrifty implementation of generic broadcast that uses atomic 
broadcast as an oracle. It works in asynchronous systems in which a majority 
of processes is correct. This implementation is given in Figure 2, and builds 
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1 For every process p: 

2 initialization; 

3 seen 4- 0; possibleRB y- 0 

4 to gbroadcast{m): send (m, first) to all processes 

5 upon receive(m, first) from g; 

6 seen 4- seen U {m} 

7 if seen has no messages conflicting with m 

8 then send (m, GOOD, second) to all processes 

9 else send (m, bad, second) to all processes 

10 upon receive(m, second) from n — f processes for the first time; 

11 seen 4- seen U {m} 

12 good ^ {r ; received (m, GOOD, second) from r} 

13 if \good\ > n/2 and seen has no messages conflicting with m 

14 then possibleRB 4- possibleRB U {m} 

15 send possibleRB n C(m.), third) to semdeir{m.) 

16 upon receive(m, third) from n — f processes for the first time; 

17 4™ {r ; received (m, third) from r} 

18 for each r E R do poss[r] 4™ M s.t. received (m, M, third) from r 

19 if |r ; m G poss[r]| > n/2 then rbroadcastim) 

20 else abroadcast{m,U’r^HPOSs[r]) 

21 upon rdeliver{m): if m not g delivered then gdeliver{m) 

22 upon adeh'cer(m, prec); 

23 for each G pnec do 

24 if rrd not g delivered then gdeliver{rrd) 

26 if m not gdelivered then gdeliver{m) 



Fig, 1, Non-trivial implementation of generic broadcast with an atomic broadcast oracle 



upon the non-trivial implementation given in Section 4. In this implementation, 
C{M) — conflicts with some m € M}, and C{m) — {m} U{m^ : 

conflicts with m}. 

Each process p keeps four variables: seen^ possibleRB.^ stable and adel. seen 
is the set of gbroadcast messages that p has seen but has not yet adelivered or 
rdelivered. possibleRB is the set of gbroadcast messages that can be rbroadcast, 
but were not yet adelivered or rdelivered. adel is the set of messages that p has 
adelivered. stable is a set of pairs (m, B)^ where m is a message and B is a set of 
messages. Intuitively, (m,-B) € stable means that p has adelivered or rdelivered 
m, and p must gdeliver all messages in B before it gdelivers m. We denote by 
7Ti the projection on the flrst component of a tuple or of a set of tuples. That 
is, 7 Ti((m, -B)) is m and 7ri{stable) is the set of m such that (m, B) e stable.^ for 
some B. 

To gbroadcast a message m, a process p sends (m, first) to all processes. 
Upon receipt of such a message, a process q adds m to its seen set, if m is not 
in 7ri{stable). Then q sends to all processes a second message containing rUj 
together with seen^ and those elements of stable whose flrst component either 
conflicts with some message in ^een U {m} or belongs to seen U {m}. When a 
process q receives (m, stj second), it adds to seen those elements in ^ that are 
not in 7ri{stable)j and it adds st to stable. When q collects second messages from 
n — f processes, it checks if seen contains m and no messages conflicting with rUj 
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and if so, q adds rn to possihleRB . Then, q sends to p — the gbroadcaster of rn 
— a THIRD message containing m, seerij possihleRB and those elements in stable 
whose first component either confiicts with some message in seenU{m} or belongs 
to seen U {m}. When p receives a third message for m from n — f processes, it 
checks if a majority of them have m in their third components and if m is not 
in ni{stable). If so, p rbroadcasts m, together with those messages in m (stable) 
that confiict with m. Else, p abroadcasts m, together with (1) the so-called flush 
set, which contains those messages that are in the seen sets of a majority of 
processes, (2) the so-called pree set, which contains those messages that are in 
the possihleRB set of some process and that either confiict with a message in 
flushU{m} or belong to flushU{m}.^ (3) those messages in 'Kflstahle) that either 
confiict with a message in flush U pree U {m} or belong to flush U pree U {m}. We 
assume that, before p abroadcasts (rn^ flush ^ pree, • • •)? P chooses some arbitrary 
ordering for the messages in flush and pree, which will be known to any process 
that adelivers (m^ flush. ^ pree ^ . . .). 

When a process q rdelivers (m, before)^ it removes m from possihleRB and 
from seeUj and adds (m, before) to stable. Then, q looks for elements (m\^B) in 
stable such that has not been gdelivered and all messages in B have been 
gdelivered. If q finds such an element, q gdelivers mb 

When q adelivers (rn^ flush ^ pree ^ before), it removes {m} Uprec U flush from 
possihleRB and from ^een, and adds adel to before. Then, q iterates over the 
ordered elements of pree. For each element m^ of pree, q adds to stable the tuple 
(m^E), where B is the elements in before that confiict with mb The intuition 
here is that (m^ B) € stable means that p must gdeliver m^ after p gdelivers all 
elements in B. Then, in a similar fashion, q iterates over the ordered elements 
of flush, to add each of them to stable. Next, q adds m to stable and adds 
{m} U flush U pree to adel. Finally, q looks for elements (m^ B) in stable such 
that nd has not been gdelivered and all messages in B have been gdelivered. If 
q finds such an element, q gdelivers mb 

Theorem 2. Consider an asynchronous system with a majority of correct pro- 
cesses (n > 2f). The algorithm in Figure 2 is a thrifty implementation of generic 
broadcast that uses atomie broadeast as an oraele. 



6 A sparing implementation of atomic broadcast 

As we explained in the introduction, we would like to solve atomic broadcast 
with an algorithm that does not rely on an oracle whenever possible. Since no 
oracle is needed to order the delivery of causally related messages, we would 
like the atomic broadcast algorithm to stop using the oracle when messages are 
causally related. 

Alore precisely, we say that message m immediately eausally preeedes message 
wd and denote m m^ if either (1) some process p abroadcasts m and then 
abroadcasts m^ or (2) some process p adelivers m and then abroadcasts mb 
Thus, is a relation on the set of messages. Let ^ be the transitive closure 
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1 For every process p: 

2 initialization; 

3 seen e™ 0; possihleRB y- 0; stable e™ 0; adel e™ 0 

4 to gbroadcast{m): send (m, first) to all processes 

5 upon receive(m, first) from g; 

6 if m ^ TTi(stable) then seen e™ seen U {m} 

7 send (m,seen,{^ € stable ; 7ri(X) G C(seen U {m})}, second) to all processes 

8 upon receive(m, s, st, SECOND) from g: 

9 seen c- seen U (s \ tti ( sta6/e)); stable c- stable U st 

10 if received messages of the form (m, Second) from n — f processes for the first time then 

11 if m ^ 7ri(sta6/e) then seen c- seen U {m} 

12 if seen O C(m) = {m} then possibleRB c- possibleRB U {m} 

13 send {m^ seen^ possibleRB ^ {X G stable ; 7ri(X) G C(seen U {m})}, third) to senderim) 

14 upon receive(m, third) from n — f processes for the first time; 

15 C- {r ; received (m, third) from r} 

16 for each r ^ R do 

17 s[r] c™ M, where M is the set such that p received (m, M, third) from r 

18 poss[r] c- M, where M is the set such that p received (m, M, third) from r 

19 stable C- stable U M, where M is the set such that p received (m, M, third) from r 

20 if |r ; m G poss[r]| > n/2 and m ^ 7ri(sta6/e) then rbroadcastim^ rc\{stable) O C(m)) 

21 else if m ^ 7ri(sta6/e) then 

22 flush 4™ {m^ : ^ m A \q : ^ s[g]| > n/2} 

23 prec 4™ ^ C (flush U {w}) 

24 a6raadcast(m, /ins/i, prec, 7ri(sta6/e) O C(/ins/i U prec U {m})) 

/* in the abroadcast message above, sets flush and prec are ordered, arbitrarily */ 

26 upon rdeliver(m, before): 

26 possibleRB 4™- possibleRB \ {m}; seen 4™- seen \ {m} 

27 stable 4- stable U {(m, 6e/ore)} 

28 while 3(rrd ^ B) G stable s.t. rrd not gdelivered and all messages in B have been gdelivered 

29 do gdeliver(ird) 

30 upon adeliver(m, flush, prec, before): 

31 possibleRB 4™- possibleRB \ ({m} U prec U flush); seen 4™- seen \ ({w} U prec U flush) 

32 before 4™- before U adel 

33 for each rrfl G prec do stable 4™- stable U {(m^ ,C(m^) fl before)}; before 4™- before U {w^} 

34 for each rrfl G flush do stable 4™- stable U {(w^ C(m^) O before)}; before 4™- before U {m^} 

/* the for each loops above iterate in the order of the ordered sets prec and flush */ 

35 stable 4™" stable U {(m, C(m) fl before)}; adel 4™- adel U {m} U flush U prec 

36 while B(rri , B) G stable s.t. rri not gdelivered and all messages in B have been gdelivered 

37 do gdeliver{m^ ) 



Fig, 2, Thrifty implementation of generic broadcast with an atomic broadcast oracle 
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of We say that m is causally related to wd if either m ^ or ^ m. If 
m and m' are not causally related, we say that m and nd are concurrent These 
definitions are based on [11]. 

Let be an implementation of atomic broadcast that uses an oracle X. 

Definition 3. We say that is sparing w.r.t. oracle if it guarantees 
the following property: if there is a time after which messages ahroadcast are 
pairwise causally related^ then eventually X is no longer used. More precisely: 
Vr € Runs{A^)j\^tj'im,^nd € ghcast-msgs{r,^\tjOo)),^m ^ ^ m] ^ 

A is not used in r after time th 

In this section, we show how to transform any implementation of atomic 
broadcast that uses some oracle A, into an implementation that is sparing w.r.t 
to A. As a first step, we show how to transform any implementation of generic 
broadcast that is thrifty w.r.t. oracle A, into an implementation of atomic broad- 
cast that is sparing w.r.t. A. This is achieved through the algorithm in Figure 3. 

In this algorithm, seq denotes the number of messages that p has abroadcast 
so far, while ndel[q] is the number of messages from q that p has adelivered so far. 
Intuitively, ts is a vector timestamp for messages such that if ts is the timestamp 
of m, then ts[j] is the number of messages from process j that causally precede 
m. We can show that if m causally precedes m' and their timestamps are ts and 
tsd^ respectively, then ts < tsh 

To abroadcast a message m, process p first obtains a new vector timestamp ts 
for m, by copying the vector ndel to ts, and then changing ts\p] to a new sequence 
number. Then p gbroadcasts m with its timestamp ts. Upon gdeliver of (m, ts), 
a process q copies ts to prec, and changes prec[sender{nn)] to ts[sender{nn)] — 1. 
Intuitively, prec represents the number of messages from each process that q 
must adeliver before q can adeliver m. Then q appends (m,prec) to L, and then 
searches for the first message (m^prec^) in L with prec^ < ndel.^ If it finds such 
a message, it adelivers increments ndel[sender{m^)] by one, and removes 
{m^,prec^) from L. 

Theorem 3. Consider an asynchronous system with at least one correct process. 
If we plug-in an implementation of generic broadcast thai is thrifty w.r.t. oracle 
X into the algorithm in Figure 3, then we obtain an implementation of atomic 
broadcast that is sparing w.r.t. oracle X. 

As we now explain, we can use this result to transform any implementation 
A^ of atomic broadcast that uses an oracle A^, into an implementation 
that is sparing w.r.t. A. To do so, we first replace the atomic broadcast oracle 
in Figure 2 with A^ , and thus obtain an implementation of generic 

broadcast that is thrifty w.r.t. A^. We then use the transformation in Figure 3 
to transform to — an implementation of atomic broadcast that 

is sparing w.r.t. A (by Theorem 3). 

Theorem 4. Given any implementation of atomic broadcast that uses some or- 
acle X, we can transform it to one that is sparing w.r.t. X. 

® We say that a vector vi < V 2 if for every q e U, vi[q] < V 2 [q]. 
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1 For every process p: 

2 initialization; 

3 ^ Q of messages abroadcast by p */ 

4 4™. 0 ordered set with message to adeliver */ 

5 for each q E U do ndel[q] 4- 0 

/* ndei[q] = of messages from g that p has adelivered */ 

6 define (m, ts) ^ (m^ ts^) iff ts ^ ts^ and ts^ ^ ts 

/* conflict relation for generic broadcast */ 

7 to abroadcast{m): 

8 seq 4- seq + 1; ts i— ndel; ts[p] 4- seq /* get new timestamp */ 

9 g6raadcast(m, ts) /* with ^ as the conflict relation */ 

10 upon gdeUver{m^ ts): 

11 prec 4- ts; prec[sender(m.)] 4- ts{sender{m)] — 1 

12 L ^ L . prec) /* append (m, prec) to L */ 

13 while 3(m^prec0 G L such that prec^ < ndel do 

14 (m^jprecO 4- first element in L such that prec < ndel 

15 adeliverirn) 

16 ndel[sender{rn )] 4- ndel[sender{rn )] + 1 

17 L ^ [jYif ^ prec) 



Fig, 3, Transforming thrifty generic broadcast into sparing atomic broadcast 



7 Low-latency thrifty implementations of generic 
broadcast 

It is easy to see that the generic broadcast implementations in Figures 1 and 2 
guarantee that in “good” runs with no failures and no conflicting messages, ev- 
ery message is delivered within 45, where 5 is the maximum network message 
delay.^ It turns out that we can decrease this latency to 35 with some simple 
modiflcations to the algorithms. Moreover, if we assume that n > 3/ (i.e., more 
than two-thirds of the processes are correct) then we can further reduce the la- 
tency to 25. With the thrifty implementation, this latency is eventually achieved 
even in runs with failures and conflicting messages, provided that there is a time 
after which the messages gbroadcast are not conflicting. 

Reducing the message latency to 35. To achieve a latency of 35 in good 
runs, we modify the implementation in Figure 1 as follows: (1) processes should 
send the third message to all processes in line 15, (2) instead of rbroadcasting a 
message m in line 19, a process p sends a message telling all processes to “deliver 
m” , and then p gdelivers m, and (3) upon the receipt of a “deliver m” message 
for the flrst time, a process relays this “deliver m” message to all processes and 
gdelivers m. With this modiflcation, it is easy to see that in good runs, every 
gbroadcast message is gdelivered within 35. 

Theorem 5. With the modifications above^ the algorithm, in Figure 1 ensures 
thafi in runs with no failures and no eonflieting messages^ every gbroadcast m,es- 
sage is gdelivered within 35, where 5 is the maximum network message delay, 

^ This assumes a reasonable implementation of reliable broadcast, which is used as a 
subroutine in these implementations. 
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We can modify the thrifty implementation in Figure 2 in a similar manner: 
(1) processes send the third message to all processes in line 13, (2) instead of 
rbroadcasting a message in line 20, a process sends a “deliver {rn^ni (stable) D 
C(m))” message to all processes, sets variable before to 7ri{stable) D C(m), and 
then executes the code in lines 26-29, and (3) upon the receipt of message 
“deliver (m, before for the first time, a process relays this “deliver” message to 
all processes, and executes the code in lines 26-29. 

Theorem 6. With the modifications above^ the algorithm in Figure 2 ensures 
that if there is a time after whieh the messages gbroadeast are not conflicting^ 
eventually every gbroadeast message is gdelivered within 35, where 6 is the max- 
imum network message delay. 

Reducing the message latency to 26 when n > 3/. To achieve a latency 
of 26 in good runs, we assume that n > 3f (instead of n > 2/). With this 
assumption. Figure 4 gives a non-trivial implementation of generic broadcast. 
The implementation is a simplification of the one in Figure 1, and uses atomic 
broadcast as the oracle. 



1 For every process p: 

2 initialization; 

3 seen v- 0; good v- 0 

4 to gbroadcast{m): send (m, first) to all processes 

5 upon receive (m, first) from g; 

6 seen v- seen U {m} 

7 if seen O C(m.) = {m} then good V- good U {m} 

8 send {m^good H C(m.y second) to all processes 

9 upon receive(m, SECOND) from n — f processes for the first time; 

iQ ; received (m, second) from r} 

11 for each r E R do g[r] v- M s.t. received (m, M, second) from r 

12 if |{r ; g[r] = > 2 n /3 then send (m, deliver) to all processes; gdeiliveirim/) 

13 else If p = sender{m) then 

14 poss V” {m^ ^ m ; |{r ; rr2 E 5 f[T]}| > n/3} 

15 abroadcast{m,poss) 

16 upon receive(m, deliver) from some process; 

17 if m not gdelivered then send (m, deliver) to all processes; gdeHver{m) 

18 upon ade/'i'cer(m, prec); 

19 for each E prec do 

20 if rrd not gdelivered then gdeliver{m6 

21 if m not gdelivered then gdeliver{m) 



Fig. 4, Low-latency non-trivial implementation of generic broadcast 



Theorem 7. Consider an asynchronous system with n > 3/. The algorithm, in 
Figure 4 is a non-trivia, I implementation of generic broadcast that uses atomic 
broadcast as an oracle. In runs with no failures and no eonflicting messages^ ev- 
ery gbroadeast message is gdelivered within 25, where 6 is the maximum network 
message delay. 
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Figure 5 gives a thrifty implementation of generic broadcast with a latency 
of 26 in good runs. The implementation is a simplification of the one in Figure 2, 
and uses atomic broadcast as the oracle. 

Note that, in line 18, process p sends a message to itself. We did this to 
avoid repetition of code; p should not really send a message to itself, but rather 
execute the code in lines 24-28. 



1 For every process p: 

2 initialization; 

3 seen v- 0; good v- 0; stable v- 0; adel v- 0 

4 to gbroadcast{m): send (m, first) to all processes 

5 upon receive(m, first) from g; 

6 if m ^ TTi(stable) then seen V- seen U {m} 

7 if m ^ TTi{stable) and seen O C(m.) = {m} then good v- good U {m} 

8 send {m^ seen^ good^ {X E stable ; 7ri(X) G C(seen U {m})}, second) to all processes 

9 upon receive(m, ss, sg SECOND) from g: 

10 seen v- seen U (ss \ TTi(stable)); stable v- stable U st 

11 if received messages of the form (m, second) from n — f processes for the first time then 

12 K V” {r ; received (m, SECOND) from r} 

13 for each r E R do 

14 s[r] c- M, where M is the set such that p received (m, M, second) from r 

15 g[r] c- M, where M is the set such that p received (m, second) from r 

16 if m ^ 7ri(sta6/e) then seen V- seen U {m} 

17 if |r ; m G g[r]\ > 2njZ and m ^ 7ri(sta6/e) 

18 then send {m^ 7ci{stable) O C(m), deliver) to p 

19 else if m ^ TTi{stable) and p = sender{m) then 

20 flush V" {m^ : ^ m A \q : E ^[g]| > 2n/3} 

21 prec V” {m : m m A \q : m E g[q]\ > ^/3} 

22 abroadcastim^ flush ^ F^ec, 7ri(sta6/e) O C (flush U prec U {w})) 

/* in the abroadcast message above, sets flush and prec are ordered, arbitrarily */ 

23 upon receive(m, before, deliver) from some process for the first time; 

24 send (m, 6e/ore , deliver) to all processes 
26 good V" good \ {w}; seen v- seen \ {m} 

26 stable V" stable U {(m, before)} 

27 while 3(mb B) E stable s.t. rr6 not gdelivered and all messages in B have been gdelivered 

28 do gdeliver{m^ ) 

29 upon adeliver(m, flush, prec, before): 

30 good V" good \ ({w} U prec U flush); seen A- seen \ ({w} U prec U flush) 

31 before A- before U adel 

32 for each rr6 E prec do stable A- stable U {(m\C(m^) fl before)}; before A- before U {m } 

33 for each m E flush do stable A- stable U {(m-b C(m6 O before)}; before A- before U {m^} 

/* the for each loops above iterate in the order of the ordered sets prec and flush */ 

34 stable A- stable U {(m, C(m) fl before)}; adel A- adel U {m} U flush U prec 

35 while 3(rrd , B) E stable s.t. rrd not gdelivered and all messages in B have been gdelivered 

36 do gdeliver(ird) 



Fig. 5. Low'-latency thrifty implementation of generic broadcast with an atomic broadcast oracle 



Theorem 8. Consider an asynchronous system with n > 3/. The algorithm, 
in Figure 5 is a thrifty implementation of generic broadcast that uses atomic 
broadcast as an oracle. If there is a time after which the messages gbroadcast 
awe not conflicting^ eventually every gbroadcast message is gdelivered within 26 ^ 
where 6 is the maximum network message delay. 
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Abstract. We consider the problem of searching for a piece of informa- 
tion in a fully interconnected computer network (or clique) by exploiting 
advice about its location from the network nodes. Each node contains a 
database that “knows” what kind of documents or information are stored 
in other nodes (e.g. a node could be a Web server that answers queries 
about documents stored on the Web). The databases in each node, when 
queried, provide a pointer that leads to the node that contains the infor- 
mation. However, this information is up-to-date (or correct) with some 
bounded probability. While, in principle, one may always locate the infor- 
mation by simply visiting the network nodes in some prescribed ordering, 
this requires a time complexity in the order of the number of nodes of the 
network. In this paper, we provide algorithms for locating an informa- 
tion node in the complete communication network, that take advantage 
of advice given from network nodes. The nodes may either give correct 
advice, by pointing directly to the information node, or give wrong advice 
by pointing elsewhere. We show that, on the average, if the probability 
p that a node provides correct advice is asymptotically larger than 1/n, 
where n is the number of the computer nodes, then the average time com- 
plexity for locating the information node is, asymptotically, 1/p or 2/p 
depending on the available memory. The probability p may, in general, 
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be a function of the number of network nodes n. On the lower bounds 
side, we prove that no hxed memory deterministic algorithm can locate 
the information node in hnite expected number of steps. We also prove 
a lower bound of ^ for the expected number of steps of any algorithm 
that locates the information node in the complete network. 

Key Words and Phrases: Search, Information Retrieval, Complete 
Network, Uncertainty, Random Walks. 

1 Introduction 

Suppose that we have a network of computers interconnected in some particular 
topology (e.g. ring, mesh, clique etc.) and one of the computers possesses a piece 
of information. The objective is to design a software agent that is able to travel 
along the communication links of the network and locate the information as fast 
as possible. An additional element to this picture, that differentiates the search 
from the usual random search, is that some of the computers, when queried, will 
respond with the name of the computer that holds the information. However, 
some other faulty computers, when queried, will give consistently wrong advice 
as to which computer has the information. (We do not consider intermittent 
faults that may appear and then disappear in an unpredictable fashion.) In this 
paper we address the problem of locating a piece of information in the complete 
network using, possibly incorrect, advice from the nodes, where each computer 
may communicate directly with any other computer. 

This variant of searching with uncertainty^ was introduced in [12], where the 
network topologies were the ring and the torus. Models with faulty information 
in the nodes have been considered before for the problem of routing (see [1, 
3, 6, 8, 14]). However, in this problem it is assumed that the identity of the 
node that contains the information is known, and what is required is to reach 
this node following the best possible route. Search problems in graphs, where the 
identity of the node that contains the information sought is not known, have been 
considered in the past. We have the deterministic search games, where a fugitive 
that possesses some properties (it is agile or inert) hides in the nodes or edges of a 
graph and the aim is to locate it using as few searchers as possible (for definitions 
and relevant theory see [5, 11, 16]). Also, the problem of exploring an unknown 
graph has been considered (see [2, 13, 17, 18], for example). Closer to the spirit 
of our work are the search problems that are defined and studied in [4] where 
the authors consider problems of locating points in the plane using incomplete 
knowledge about their position. On another front, a number of algorithms have 
appeared that search a graph relying on some sort of random walk along its 
edges (for a comprehensive and very readable survey, see [15]). We have a very 
interesting class of games called stochastic games in which opponents’ strategies 
incorporate some sort of probability transition matrix (see [21]). Also, in [22] one 
may find a detailed treatment of Geometric Games where the aim is to locate 
elements of a hidden set. 
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In this paper, we show how to locate a piece of information on the complete 
network of n nodes nsing some advice they give when qneried. The advice is cor- 
rect, i.e. it directs to the information node, with some bonnded probability p(n). 
The class of algorithms we consider can execute one of the following operations: 
(i) query a node of the complete network about the location of the information 
node (ii) follow the advice given by a queried node and (iii) select the next node 
to visit using some probability distribution function on the network nodes (that 
may be, for example, a function of the number of previously seen nodes or the 
number of steps up to now) . 

The algorithms we consider may be either memoryless, have limited memory 
or have unlimited memory. An algorithm is memoryless, if it may not store any 
information before it moves from a node to the next one. It is of limited memory, 
if it can only store a fixed amount of information (i.e. independent from the size 
of the complete network). Finally, an algorithm is of unlimited memory, if it may 
store any amount of information, usually the identities of the nodes encountered 
in the past. In addition, an algorithm may use randomization or be deterministic. 

Table 1 summarizes our results for various types of algorithms. We observe 
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Table 1. Expected number of steps for various classes of algorithms 



that no deterministic algorithm with no memory at all or with fixed amount of 
memory can locate the information node in a finite expected number of steps. 
The reason for this is that it may fall into cycles in the space of its states, for 
sufficiently large cliques. However, there exists a deterministic algorithm with 
unlimited memory (O(nlogn) bits always suffice), that achieves an expected 
number of steps asymptotically equal to Moreover, randomized algorithms 
can achieve expectation even with a limited amount of memory. We give an 
algorithm that simply remembers if in the previous step it followed the advice of 
the queried node or not, thus requiring only one bit of memory. Moreover, when 
unlimited memory is available, the expectation falls to 

In what follows, we first describe the network model we adopted and how 
the faulty and non-faulty nodes are determined. We then prove a lower bound 
of ^ (we assume that p is not 1, since this trivial case is easily handled) for the 
expected number of steps required by any algorithm that locates the information 
node in the complete network. We also give an argument for the impossibility 
of achieving finite expectation in the number of steps with memoryless deter- 
ministic algorithms. We then describe some algorithms for searching for the 
information node and analyze their complexity. 
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2 The computer network model 

In order to present and analyse the varions algorithmic approaches to the prob- 
lem of searching for a piece of information in the complete network, we will dehne 
a model of the compnter network that captnres the fact that some compnters 
give a valid advice (i.e. they point to the node containing the information) while 
some others give faulty advice by pointing elsewhere. 

The computer network is modelled as a clique on n nodes. There are direct 
links between any pair of nodes. Each node contains the following information: 

— Node identity number. This is a number within the range 1 . . . n that uniquely 
identihes each node. 

— Advice. This is also a number within the range 1 . . . n and it is interpreted 
as the advice of the node as to which node of the network contains the 
information. 

In order to model the fact that the advice of a node may be incorrect with 
some probability as well as the fact that the information may reside in any node 
in a random fashion, we introduce randomness to the model in the following 
way: 

— A node is randomly and uniformly selected from among the n nodes that 
contains the information and its identity, say 5, becomes known to the other 
nodes. Node 5 sets its advice to be equal to itself, i.e. to contain 5. In this 
way, any algorithm may distinguish a node that contains the information 
from one that does not. 

— Each node, except 5, flips a coin that shows heads with probability p (a 
varying parameter of the model) and if head shows-up, then the node sets 
its advice equal to 5. These nodes give correct advice about the location 
of the information. If tails appears, then the node randomly and uniformly 
selects a number from within the set {1 ... n} — {s, id} where id is the node’s 
identity number. These nodes are the faulty nodes. 

After this procedure is carried out, no changes are permitted to any piece of 
information that was determined by the procedure. That is, the sets of faulty 
and non-faulty nodes remain the same and the wrong advice given by the faulty 
nodes never changes. 

3 Some lower bounds for searching in the clique 

Let Kn = (bn, En) be the clique graph on n vertices. In this section, we will prove 
that if p = cc(l/n), then no algorithm that searches for a piece of information 
in Kn (under the model we have defined) can do, asymptotically, better than 
executing 1/p steps on the average. To this end, we will use the fact that given a 
random set of node A C Vh of cardinality k an adaptive random walk that starts 
from a node m Vn — S will hit some node in S in expected time that converges 
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to 0(^), as n tends to infinity. An adaptive random walk is simply a random 
walk in which after each step, the currently visited node, along with its edges, 
is deleted from the clique. 

Our aim is to show that the problem of locating the information node is 
at least as difficult as the problem of locating one of k non faulty nodes using 
a random walk that starts from a randomly chosen initial node. The following 
proposition holds: 

Proposition!. For a randomly selected set S of non-faulty nodes^ locating the 
information node t E S is at least as hard, on the average, as hitting the random 
set S. 

It is easy to see why this proposition is valid by observing that any algorithm 
that locates the information node in some number of steps may also be used to 
hit the random set to which the information node belongs in the same number 
of steps. 

Now given a node in lA, we denote by r^ (u) the probability of the adaptive 
random walk choosing vertex u v when it leaves from node v. In the uniform 
adaptive random walk on AA, ry{u) = for every v,u. The following can be 
proved: 

Lemma 2. The uniform adaptive random walk is an optimal algorithm for hit- 
ting a randomly selected set of k cligue nodes. 

Proof Suppose that an algorithm uses a non-uniform distribution function 
ry{u) (that may also be a function of the current step) and that it is about to 
select the next vertex to visit. Then the probability of failure to select at step i 
a node from the random set of size k is 

Pv - N ry{u)(l 

\ n — I 

u adjacent to v 

u adjacent to 

since nothing has been revealed about the initial random set of k nodes (it is 
still random after the deletion of i clique nodes). 

Therefore, we see that the probability of failure from any node is independent 
of the probability distribution chosen for the node by the algorithm. Therefore, 
the uniform distribution is an optimal one. 

Now no algorithm can have at the ith step a probability of failure less than 
the probability of failure of the adaptive random walk. This is because for such 
an algorithm the set of the k selected nodes is still random and the clique is 
essentially the clique Kn-i- Therefore, the probability of not hitting this set, 
using any distribution on the outgoing edges of the currently visited node, is 

■ 

n — t 
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According to the lemma above, the number of steps taken by the adaptive 
random walk is a lower bound for the number of steps required by any algorithm 
to locate one of the randomly chosen nodes. The expected number of steps 
required by the uniform adaptive random walk is the following: 



n-k-\-l i—1 



i: ’-n 



z = l /=1 



— / + iy n — i -\-l 



As n tends to inhnity, the above expression is bounded from below, asymptoti- 
cally, by the expectation of the geometric distribution with probability of success 
The following lemma will help us establish this fact: 

Lemma 3. The following inequality holds: 



(n -h 1 — z)[z — kY , n — k 



(n — ky 



n-k-\-l i — 1 



^ s-n 



i-1 1=1 



— / -hi/ n — i 1 



Proof We write 



n-k-\-l i—1 






i=l 1=1 



n — / -hi/ n — i-hl 






k \ k n — k I 

-/-hi/ n — i-hl~^ {k 1)^~^ ' 



Then for the product it holds that 



n 1 



n — / -h 1 






n — / -h 1 



> exp 1„(1 - dl 

( 1 (n — i -h 1)^“*"^^ 

\ n) (n — ky (n — i -h 1 — 



Therefore, using again approximations of sums with integrals, we obtain the 
following: 



n-k-\-l i—1 






i=l 1=1 



n — / -hi/ n — i-hl 



> 1-- 

V n 



k . {n — i-\- 1)^ * n — k ^ 

{n-hY - i + I - {k + 1)”-* 
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From the corollary, we see that the uniform adaptive random walk cannot do 
better, asymptotically, than achieving an expected number of steps j, k — 
o{n). If - = c < 1, then the lower bound becomes, asymptotically, - — 1 but it 
is of no better “order” than 

c 

In order to use the corollary in the context of searching with uncertainty in 
the complete network, we observe that if the probability of a node being non- 
faulty is p, then, on the average, there are pn non-faulty nodes in the network. 
The remaining nodes are faulty and they give incorrect advice. Therefore, any 
algorithm that searches for the information node can, at the very best, locate 
one of these randomly chosen pn nodes and follow its advice. This can be done 
in expected number of steps ^ = i 

^ ^ k pn p 

We will now give an argument for showing that no bounded memory deter- 
ministic algorithm can achieve a finite expected number of steps. In other words, 
randomization is necessary if memory is bounded. Let T be a deterministic al- 
gorithm with a fixed amount of memory, that locates the information node in 
cliques of any size. A deterministic algorithm that performs one of the operations 
we described, can be thought of as a function / : (S', V) ^ (S, V U {0}), where 
S is the finite set of different states A may assume, V = {i^i, . . . , is the set 
of nodes and the number 0 means that the algorithm follows the advice of the 
currently visited clique node. By states of the algorithm^ we mean the different 
possible contents of its memory combined with its internal states. The set of its 
internal states is necessarily finite. 

It is easy to see, that there can be no deterministic algorithm that locates 
the information node in cliques of any size, that does not follow the advice of 
at least one of the nodes it encounters. For let be one of the cliques that 
the algorithm handles correctly. Then the set of pairs (s, i;) with s a state of 
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the algorithm and v E Vn ^ vertex of the clique with n nodes is hnite since 
the algorithm has finite memory. If is an integer larger than \S x Vn\, then 
consider the operation of the algorithm on K^f (in what follows, we consider 
as a subgraph of Kn')- Since the information node is chosen randomly, there is a 
nonzero probability that it will be one of the nodes in Vnf — Vn, say in Since the 
algorithm is deterministic, there is no pair (s,i;), with v G Kn, that is mapped 
onto {s' , v') for some state s' . Then the algorithm fails to locate the information 
node, which contradicts our assumption that it works correctly for cliques of 
any size. Therefore, there must be some state/node pair that is mapped onto 
(5^,0) by the algorithm, for some state s' and, moreover, the algorithm must 
encounter at least one of these pairs, say (s,i;), at some point of its operation 
on Kn (otherwise it would, again, not operate properly with Kn')- Now there is 
a nonzero probability that the following events occur simultaneously: 

— The node v is faulty. 

— The (wrong) advice it gives sends the algorithm to a previously encountered 
node (but, possibly, the algorithm is now in another state). 

The algorithm must again follow the advice of some node u in at some point 
for otherwise there exists again a sufficiently large clique that cannot be handled 
correctly. Again, the two events above occur simultaneously for u with nonzero 
probability. Since the advice of the faulty nodes never changes, the event that 
the algorithm will be constantly directed to previously seen state/node pairs (for 
Kn) is fixed and nonzero. Since the state/node pairs are finite, the algorithm will 
eventually reach a previously encountered pair (s,i;), with some fixed nonzero 
probability. If this occurs, then the algorithm will repeat infinitely the same 
steps without ever locating the information node in AA- Therefore, the expected 
number of steps required by the algorithm cannot be finite. 

The essence of the above argument is that a finite memory deterministic 
algorithm should always follow the advice of the nodes it encounters. However, 
in doing this, it will eventually enter a cycle in its state space with fixed and 
nonzero probability. It can also be proved that if we alter our model so that 
when a faulty node is queried again it determines its wrong advice at random 
and does not necessarily give the same wrong advice, then the “always follow 
the advice” policy results in an algorithm with expected number of steps equal 




4 Fixed memory and randomization: one bit of memory 
helps 

In this section, we will show that just one bit of memory suffices in order to 
reduce the expected number of steps to locate the information node from n — 1 
(random walk) to which is o(n). In contrast, when the search algorithm is not 
allowed to have any memory, then an optimal way to search for the information 
node is to perform a random walk on the clique nodes. Such a random walk will 
hit the information node in expected number of steps n — 1 (see [15]). 
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The algorithm given below, simply alternates between following the advice 
of the currently visited node and selecting a random node as the next node to 
visit. It only needs to remember if at the last step it followed the advice or not. 
Although one bit that it is complemented at each step suffices, the algorithm is 
stated for convenience as if it knew the number of steps it has taken, checking 
at each step if this number is odd or even. 

Algorithm: Fixed Memory Search 

Input : A clique (U, E) with a node designated as the information holder 

Aim: Find the node containing the information 

1. begin 

2. current = RANDOM (U) 

3. / ^ 1 

4. while current (information) true 

5 . / ^ + 1 

6. if I mod 2 = 0 

7. current = current (advice) 

8. else 

9. current ^ RANDOM (U — current) 

10. end while 

11. end 

Let us estimate the average number of steps that are required by the algo- 
rithm in order to locate the piece of information. 

At step /, failure can occur in one of the following ways: 

— If / = 1, the algorithm fails if it randomly selects one node other than 
the node containing the information. The probability for this happening is 

— If / is even, the algorithm fails if the currently visited node is faulty. The 
probability of this event is q\ — q — 1 — p, 

— If / is odd larger than 1, the algorithm fails if it randomly selects a node 
other than the currently visited one, that does not contain the information. 
The probability of this event is qi = [1 — ^^^)- 

Then the expected number of steps (expectation of the random variable /) is 
the following (where qj = I — pj): 



E[i^ of steps] = I • Pr[first success in I-th step] 



1=1 



gi • --qi-ipi 

1=1 



i> 2 , even 



n — 1 



(i-g) 
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1 1 



V n) ^ ^ ^ n-r n-1 

i>3, odd 



1 

n 



+ (i-;)EW+i) + i),'«(i-;^)'^ 

^ ^ />0 ^ ^ 



1 

n 



After some tedious algebraic manipulations, it can be shown that 



E[^ of steps] = 1 + 



(n- 1)^(1 +g) 
n(n — qn -\- 2q — 1) 



= 1 + 



2 + 
p + 



l-2p 

n 

l-2p * 
n 




2 - 



P 



p + 



l-2p 

n 



If pn oo, i.e. p = then the expression above converges to However, 

if pn 0, i.e. p = o(-), the exppression for the expectation tends to 2n. If 

p = 0(-) so that pn c, then the expectation is asymptotically equal to — 
which again tends to 



5 Unlimited memory: randomized and deterministic 
algorithms 

We will now give an unlimited memory randomized algorithm for locating the 
information node in a clique. More specifically, the algorithm can store the pre- 
viously visited nodes using O(nlogn) memory bits and use this knowledge in 
order to decide its next move. In this way, the algorithm avoids visiting again 
previously visited nodes. Such an algorithm always terminates within n — 1 steps 
in the worst case. 

Algorithm: Unlimited Memory Rendomized Search 

Input : A clique (U, B) with a node designated as the information holder 

Aim: Find the node containing the information 

1. begin 

2 . / ^ 1 

3. current = RANDOM (U) 

4. M ^ {current} f f M holds the up to now visited nodes 

5. while current (information) true 

6. read(current (advice)) 

7. if current ^ M 

8. M ^ M U {current} 

9. current = advice 

10. end if 
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11. else 

12. current ^ RANDOM(U - M) 

13. /^/ + 1 

14. end while 

15. end 

We can now prove the following: 

Theorems. If p — then the expected number of steps required by the 

algorithm in order to locate the information is^ asymptotically^ equal to 1/p. 

Proof The algorithm fails at step / > 1, due to one of the following events: 

1. Event Ai: the answer v to the query is one of the previously visited values. 

2. Event A 2 : the answer is a node that lies outside the previously visited set 

but the piece of information is not there. 

Let E] be the event that at steps 1 . . . / the algorithm failed to find the infor- 
mation. Then the probability of failure at the /th step, given that the algorithm 
has failed at steps 1, 1, is equal to qiq2 ' ' 'Ih where 

qi — Pr[failure in /-th step|E;/_i] = Pr[Ti|E;/_i] -h Pr[T 2 |E'/_i] 

/ n — I — 2 n — I — 2 

— ^ 7: ' ~j 7 + ^ 7T~ 

n — 2 n — I — I n — 2 

n — I — 2 f I \ n — I — 2 n — 1 

^ n — 2 \ — / — 1/ n — / — 1 n — 2' 

It follows that the expected number of steps of the above algorithm is given by 
the formula 

n — 2 n— 2 

E\^ of steps] — ^ ^ / * Pr [first success in /-th step] — E I ■ qoqi ■ --qi-ipi 
1-1 1-1 

n—2 n—2 

= X) C gogi • • • gf-i - X) C qoqi ■ ■ ■ qi-ipi 

1-1 1-1 

n — 2 

- X) qoqi ■ ■ - qi-i - {n - 2)qoqi ■ ■ ■ g „_2 

1-1 




n — I I — a n — I (1 — a)^ 
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where a = To obtain the last formula we used the well-known expansion 

of the geometric progression. It follows that asymptotically in n, 



1 — a' 



n — 1 



1 - 



1 — a p 

— 1 / nq^e — nq^~^e q 

(1 — a)^ n — 1 \ 

Substituting this in the previous formula we obtain that 

1 



1 

of steps] = 

p p 



n — 1 



_qn-\-i^ nq^e — nq^ ^e-\-q 

p‘^ 



As ^ = o(n) (from the hypothesis that p = c<;(l/n)), we obtain that, asymptoti- 
cally, E[y^ of steps] = 1/p, which completes the proof of the theorem. ■ 

As we have proved, no fixed memory deterministic algorithm has a finite 
expectation in the number of steps to locate the information node. However, 
the unlimited memory randomized algorithm we described in this section can be 
converted easily into an unlimited memory deterministic one by only changing 
line 12 to be as follows: 



current = the lexicographically smallest node in V — M . 

This change does not affect the analysis of the randomized algorithm because 
the probability that the chosen node is faulty or not, or that it is the information 
node, is the same as if the next node had been chosen at random. 



6 Conclusions 

In this paper we have considered the problem of searching for an information 
node in the complete network using information about its location obtained from 
the nodes of the network, where the information is correct with some bounded 
probability p. This problem was introduced in [12] where the ring and toroidal 
interconnection networks were considered. 

For the complete network, we proved that there is no algorithm that can 
locate the information node in expected number of steps less than We also 
proved that there is no fixed memory deterministic algorithm that achieves a 
finite expectation in the number of steps. We also gave various search algorithms 
and analyzed their expected number of steps. It is interesting to consider the 
same problem for general graphs. One complication that immediately arises is 
that a node giving correct advice does not point to the information node directly, 
since it may not be adjacent to it, but it points to a node that lies on a shortest 
path to that node (see [12]). It also appears that deriving useful (i.e. as functions 
of the number of nodes) lower bounds for the expected number of steps to locate 
the information node must be difficult. Our proof of the ^ lower bound was 
based on the analysis of the expected number of steps required by a random 
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walk on the complete network before hitting a randomly chosen set of nodes 
that give correct advice. We believe that doing the same for other classes of 
graphs is an interesting line of research. However, tackling more general classes 
of graphs nsing the random walks techniqne, seems to be difficnlt, in view of 
the discnssion in [15] abont the analysis of the expected hitting times in random 
walks on varions classes of graphs. In [15] it is stated that hitting (or access) times 
can remain bonnded, i.e. independent of the number of nodes of the graph, even 
for regular graphs. In the same paper, it is also stated that the only graphs for 
which a nonconstant lower bound on the expected hitting time can be proved 
are the graphs with transitive automorphism group. Therefore, deriving lower 
bounds for the problem of searching with uncertainty in more general classes of 
graphs, may require more complex lower bound techniques. 
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Abstract. We present a new replication algorithm that supports repli- 
cation of a large number of objects on a diverse set of nodes. The algo- 
rithm allows replica sets to be changed dynamically on a per-object basis. 
It tolerates most types of failures, including multiple node failures, net- 
work partitions, and sudden node retirements. These advantages make 
the algorithm particularly attractive in large cluster-based data services 
that experience frequent failures and configuration changes. We prove 
the correctness of the algorithm and show that its performance is near- 
optimal. 



1 Introduction 

We present a new lightweight replication algorithm designed for PC-based In- 
ternet data services, snch as FTP, and email. Onr algorithm is nniqne 

in many aspects. First, onr algorithm lets copies of an object, or replicas, be 
created or deleted dynamically and yet guarantees that the state of ah the repli- 
cas eventually converges. Second, it is designed specifically to support many 
small replicated objects, which is typical in web-based environments; in partic- 
ular, it has low space and computation overhead and handles object deletion 
efficiently. Third, the algorithm tolerates most failure types, including multiple 
node failures and sudden node retirements, and network partitions. Finally, it 
is non-blocking and decentralized, that is, it lets any node issue an update, and 
it makes no single node permanently responsible for maintaining replica consis- 
tency. We achieve this efficiency and versatility by being optimistic, that is, we 
guarantee that replicas become consistent only eventually. This optimism pre- 
cludes the use of our scheme in applications that demand high data reliability, 
such as banking, but it poses little problem in typical Internet services, because 
these services have simple update semantics and weak consistency requirements. 

1 . 1 Background 

Our work derives from the Porcupine cluster-based mail server project [19]. Por- 
cupine connects up to several hundred off-the-shelf computers to serve billions of 
mail messages per day. Porcupine’s architecture is fully dynamic; any node can 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 297-314, 2000. 
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potentially manage any user’s profile and store any user’s email messages. For 
each incoming message, Porcupine chooses a set of nodes on which to replicate 
and store the message (called a replica set), based on node load and message 
affinity. A cluster membership service and a distributed naming service keep 
track of the locations of users’ messages. The dynamic message placement yields 
many benefits: performance improvement via balanced load and fiexible support 
for system configuration changes by message migration. 

From the viewpoint of a replication service. Porcupine presents an environ- 
ment very different from traditional database systems. Following are the key 
characteristics of Porcupine and a discussion of how they define the goals of our 
replication service: 



Frequent failures; With hundreds of nodes in the cluster, a part of the system 
is always down. First, the algorithm must provide strong fault tolerance 
by maintaining replica consistency even when some of the nodes are down, 
sometimes permanently. Second, the algorithm must be non-blocking, that 
is, it must allow reads and writes to any replica any time regardless of whether 
peer replicas are reachable. 

Changing replica sets; Porcupine needs to change email message replica sets 
automatically to react to node additions and removals. We must support dy- 
namic addition and removal of replicas while allowing contents updates 
to the object. 

Small object size; The unit of replication in Porcupine is an email message 
whose average size is 5K bytes, as opposed to many gigabytes typical in 
database systems. We need to minimize the space overhead of per-object 
data structures used to maintain replica consistency. 

Selective replication; Porcupine stores billions of email messages, each of 
which is in its own replica set. Our algorithm needs to be quiescent, that is, 
it should incur no computational and space overhead when no update is in 
progress. Moreover, email messages are deleted frequently. Thus, our system 
should support quick object deletion without leaving any data structures 
behind. 

Weak consistency requirements; Services such as email do not demand 
strict replica consistency because the possibility of inconsistent data is in- 
herent in the environment; for example, unreliable network transport can 
cause delivery delay or duplicate messages. Thus, we only need to support 
eventual consistency of replica state. 



These service requirements are not unique to email — in fact, they are shared 
by many other Internet applications, including Usenet [20], Internet-based BBS 
services (e.g., slashdot.org or delphi.com), naming services [15, 13], and wide-area 
mirroring of Web or pTp data [17]. All these services are potential applications 
of our replication algorithm. 
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1.2 Related Work 

Data replication has been studied and deployed widely. However, previous work 
in this area has addressed the goals of onr algorithm only in a piecemeal fashion 
and has not solved all of the problems that we face in onr intended environment. 

Traditional replication algorithms include primary-copy algorithms used 
widely in commercial database systems [3], quorum consensus algorithms [7, 
9, 10], and atomic broadcast protocols [4, 1]. They try to achieve single-copy se- 
mantics, that is, they give users an illusion of having a single, highly available 
copy of an object. Although generic, these algorithms fail to address problems we 
face - frequent failures and frequent changes - because they sacrifice availability 
by prohibiting accesses to a replica when data is not provably up to date. 

Mobile replicated database systems (e.g.. Bayou [16], and Roam [18]) share 
some of our goals: elimination of a single point of failure, handling frequent fail- 
ures, and dynamic replica addition and deletion. The main difference between 
their solutions and ours is that these systems focus on minimizing the commu- 
nication overhead, whereas we focus on minimizing the space and the compu- 
tation overhead. For example, the techniques used by these systems, including 
on-demand polling and a semantic log for describing updates, reduce the com- 
munication cost but increase both the computation and the storage overhead. 

Our algorithm is most closely related to multi-master wide-area replicated 
services, including Active Directory [13], Xerox Clearinghouse [6], and Usenet 
[20]. These systems provide non-blocking accesses, support replication of many 
small objects, and propagate updates efficiently over unreliable links. However, 
they do not support replica set changes and provide only a weak fault tolerance; 
for example, one failed node can stall the update propagation of the entire sys- 
tem. Moreover, the existing systems do not support quick object deletion and 
require storing update records (often called “death certificates”) for an indefi- 
nitely long period. 

Several systems allow dynamically changing the placement of replicas us- 
ing reference-monitoring mechanisms to balance the system load [17,23]. They 
update replicas by gossiping changes along a spanning tree and are unable to 
achieve replica consistency even under a single node failure. Our work comple- 
ments them by proposing a robust mechanism that can tolerate a wider variety 
of failures. 

1.3 Overview of the Algorithm 

Our algorithm is based on three principles: state transfer^ update resolution 
using Thomas write rule, and update retirement using synchronized clocks. 

In its basic form, our algorithm is similar to systems such as Active Direc- 
tory [13] and Usenet [20]. Any replica (or any node for a newly created object) 
can issue an update any time. A coordinator, usually the issuer of the update, 
propagates the update by pushing the new object state to others in background. 
Confiicting updates are resolved by Thomas write rule [21], that is, by attaching 
timestamps to them and accepting only the newest update. 
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Our algorithm is unique in its uniform handling of replica set updates and 
contents updates — in fact, an update is actually a tuple consisting of the new 
object contents and the new replica set. For an update that changes the replica 
set, the coordinator pushes the update to the union of the old and the new replica 
sets (called the targets of the update). A node receiving the update either mod- 
ifies, creates, or deletes a replica depending on whether or not it appears in 
the new replica set. Thomas write rule is again used to resolve confiicts among 
replica set changes; the older updates are canceled by forwarding the newest 
update to their targets and letting them be rolled back. Overriding the older 
updates requires the coordinator of the newer update to discover the older up- 
dates’ targets. This node discovery problem is similar to the distributed resource 
discovery problem [8], and the solution is also similar: the nodes that receive the 
update send back to the coordinator the sets of nodes they know and let the 
coordinator expand its target node set transitively. 

We apply at-most-once messaging technique using synchronized clocks [12] 
to retire updates. After the coordinator completes update propagation, it sends 
out retirement notices to the target nodes. Upon receiving a retirement notice, 
a node deletes the update from disk after waiting for a fixed period; the wait 
ensures that the node will never apply a stale update in the future. 

This combination of state transfer, Thomas write rule, and quick update 
retirement allows our algorithm to solve our goals effectively as follows: 

— Our algorithm’s basic design directly achieves the three goals — non-blocking 
access, dynamic replica set changes, and eventual consistency. 

— We achieve strong fault tolerance in two ways. First, our algorithm is fully 
decentralized — in particular, it lets any node take over the task of the 
coordinator any time. Second, its node discovery process eliminates a single 
point of failure quickly and lets the system tolerate a sudden node retirement 
without compromising replica consistency. 

— Our algorithm’s space overhead is quite small for two reasons. First, the state 
transfer architecture minimizes the size of the update record by omitting the 
new object contents - the object contents are usually read from the replica 
directly. Second, our algorithm quickly reclaims the space occupied by update 
records by retiring them as soon as they finish propagation. 

1.4 Structure of the Paper 

The rest of the paper is structured as follows. We describe our algorithm in de- 
tail in Section 2 and show two examples in Section 3 to elucidate its behavior, 
in particular, the resolution of concurrent updates. Section 4 proves the correct- 
ness of our algorithm. In Section 5, we discuss several extensions to the basic 
algorithm to address issues that arise in practice: e.g., optimizations to make 
the algorithm work efficiently and the handling of long-term failures. We briefiy 
discuss the computational and the space overhead of the algorithm in Section 6 
and conclude in Section 7. 
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2 The Replication Algorithm 

Although our algorithm is designed to support many objects replicated on a 
diverse set of nodes, it is first presented in the context of a single object. We 
describe a straightforward extension of the basic algorithm to support multiple 
objects in Section 5.1. 

2.1 System Model and Assumptions 

We make the following assumptions about the environment. First, the nodes 
in the system can communicate through a fully connected network. Second, 
nodes and network links may crash, and messages may be reordered, delayed, 
or lost, but Byzantine failures will not occur. Finally, the nodes have loosely 
synchronized clocks; clock synchronization algorithms are well known and are 
deployed widely [14]. 

Our algorithm only propagates updates. Locating and reading replicas and 
choosing the replica placement are outside the scope of this algorithm. For locat- 
ing replicas, any weakly consistent naming service can be used. For replication 
policy, an object could be assigned to a random set of nodes [19] or to a set 
determined by reference-monitoring mechanisms [17,23]. 

2.2 Notational Conventions 

All global variables are stored on stable storage and survive node crashes. Pro- 
cedures marked public run as transactions that are non-preemptive and update 
global variables atomically, (vaii, vah) is a tuple of two values. Send(node, proc^ 

args . . . ) sends a message to node and requests calling proc with args Send 

does not wait for proc to finish; it merely queues the message. Texts after ‘ | ^ 
are comments. 

2 . 3 Data Structures 

Fig. 1 shows the types 
used in the algorithm. Loosely 
synchronized clocks order up- 
dates [14]. Other types of 
clocks, e.g., logical clocks [11], 
may be used without affect- 
ing the correctness of the al- 
gorithm, but wall clocks best 
suit our purpose because they 
can order logically unrelated 
events (e.g., a user contact- 
ing two nodes in a cluster seri- 
ally). The procedure Now() returns the current local clock value. We assume the 
clock resolution to be fine enough that successive calls to Now() always return 
different values; this assumption lets us use timestamps to identify updates. 



type Timestamp = record 
time: Clock | wall-clock time. 
nid: NodelD | a tie-breaker. 

end 

type Update = record 

state: {ACTIVE, RETIRING, 

RETIRED, SUSPENDED} 
ts: Timestamp 
target^ done, peer: Node Set 
end 



Fig. 1. Data structures used by the algorithm. 
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An update to the object is represented by the Update record. Its state field 
indicates the update’s status. A new update starts as being ACTIVE. An update 
is RETIRING on the coordinator while retirement notifications are sent, and it is 
RETIRED after the reception of a retirement notice. An update is SUSPENDED 
when it is found to confiict with a newer update, and it stays dormant until the 
newer update arrives and supersedes it. Target, done, and peer fields specify the 
set of nodes that should receive the update, have acknowledged the update, and 
should replicate the object, respectively. Thus, done and peer are always subsets 
of target. The update propagation finishes when done = target. 

Five persistent variables 
are stored per replica on a 
node (Fig. 2). Two of them, 
gData and gPeers, are visi- 
ble to the application and the 
rest are used internally by the 
replication algorithm. GData 
stores the actual contents of 
the object — we are not con- 
cerned about the object’s internal structure in this paper. GPeer shows, to the 
best of the node’s knowledge, the replica set of the object. GU remembers the 
newest update applied on the object. Notice the absence of the new object con- 
tents in gU — the contents are propagated to other nodes by reading from gData 
directly most of the time. The exception is when gData is deleted by an update, 
but the object contents still need to be propagated to other nodes (this happens, 
for example, when the object is moved from one node to another). GSavedDatais 
used to save the new object contents in such cases, and it is otherwise null. That 
gSavedData is usually null contributes to reducing the space overhead of the 
algorithm, because all other data structures used by the algorithm are of small 
and fixed size. GRetireTime is used to delete a retired update and is discussed 
further in Section 2.7. 

2.4 Application Programming Interface 

One procedure, Upda- 
teObject (Fig. 3), is called by 
the application. It takes two 
parameters, the new replica 
set (peer) and the new object 
contents (data). Passing an 
empty set to peer will delete Fig* 3. UpdateObject is called by the application to 

the object entirely from the create a new object, modify the object contents, add 

system. The caller of this or remove the replica set, or delete the object, 
procedure must ensure that the node stores a replica already, except when the 
object is being newly created. This restriction, also discussed in Section 4.2, is 
to prevent creating an orphan replica that is disconnected from others and is not 
found by the node discovery process. The implicit variable me shows the name 
of the node itself. 



public proc UpdateObject (peer, data) 
u^Update(ts^ Timestamp(Now(), me), 
state^ACTIVE^ done^(f>^ 
peer^peer, target^peer) 
ApplyUpdate(u, data) 



war 

gData: Cant entst^ NULL 
gPeer: NodeSett^(^ 
gU: Updated NULL 
gRetireTime: Clock 
gSavedData: Contents 



Fig. 2. Per-node, per-object global variables used 
by the algorithm. 
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2.5 Update Application 



proc ApplyUpdate(uj data): bool 
I Expand knowledge. (1) 
u. target ^ u. target U gPeer 
if gU / NULL 

A gU.state g {RETIRING, RETIRED} 
u. target ^ u. target U gU. target 
gU. target ^ u. target 
I Reject if u is stale. (2) 
if gU ^ NULL A gU.ts > u.ts 
return false 
I Log the update. (3) 
gU ^ u 

gSavedData ^ NULL 



I Modify the replica, 
if me e gU.peer 

{gData^ gPeer) ^ (data, gU.peer) 
else 

(gData, gPeer) ^ {NULL^ (j>) 
if gU.peer ^ ^ 

I Save data; not needed when peer is 
I (j> since everyone deletes the replica. 
gSavedData ^ data 

gU.done ^ gU.done U {me} 

return true 



Fig. 4. Local update application. This procedure logs and applies the update to the 
local replica. It returns whether the update was successfully applied or not. 



The procedure ApplyUpdate (Fig. 4), called both from the local application 
and from remote nodes, logs and applies the update to the replica and prepares 
for update propagation. It first merges the target sets of both u and the current 
update, gU{ij^. This must be done even when u is to be discarded ^2) so that 
the participants of both updates can eventually receive the newer update. 

2.6 Update Propagation 

An update is pushed to other nodes periodically by PushUpdate (Fig. 5). The 
target node set expands as replies come back from the target nodes The 
propagation finishes when all the target nodes reply ^4) . 

The function lAmCoordinator tells whether the node is designated to coor- 
dinate a particular update. For now, it just returns true, meaning that any node 
can be a coordinator, and that an update is fiooded among all the target nodes. 
Having multiple coordinators does not affect the correctness of the algorithm, 
but it surely wastes the network bandwidth — we improve this design in Section 
5.2. 

2.7 Deleting Retired Updates 

While each update record occupies only a small space, it is stored even when 
the replica itself is removed. We need to delete updates in a timely manner; 
otherwise, the update records of deleted replicas will accumulate and eventually 
fill the disk up. An update is deleted from disk in two steps. The first step, 
performed periodically by PushRetire (Fig. 6), is the update retirement; the 
coordinator informs the target nodes that update propagation is complete. The 
second step is the update removal in which we apply the at-most-once messaging 

^ Markers such as (i) refer to lines in the program listings. 
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public proc PusliUpdate() 
if gU^NULL A gU.state = ACTIVE 
if gU.done=gU .target \ Update done (4) 
gU. state ^ RETIRING 
gU.done ^ {me} 
gSavedData ^ NULL 
elif lAmCoordinator(gU) 

foreach node € (gU. target ~ gU.done) 
if node G gU. peers 

I Node will delete the replica. 

data ^ NULL 

elif gData / NULL 

I I store valid contents. 

data gData 

else 

data gSavedData 
S end (nodCj UpdateRequest^ 
me, gUj data) 



public proc UpdateRequest(caiierj data) 
ok ^ ApplyUpdate(u^ data) 

Send(caiieA UpdateReply^ 

u.ts^ ok^ gU.done, gU. target) 

public proc UpdateReply(tSj ok, 
done, target) 

if ^UpdateOverwritten(ts) I (5) 
if -I ok 

gUstate ^ SUSPENDED; 

return 

gU.done ^ gU.done U done 

gU. target ^ gU. target U target | (6) 

proc UpdateOverwritten(ts): bool 
return gU=NULL \ The update retired 

V gU.ts > ts I A new update arrived 

proc lAmCoordinator(u): bool 
return true 



Fig. 5. Update propagation. PushUpdate is called periodically to push the newest 
update to participants. UpdateRequest is executed on remote nodes in response to 
PushUpdate. UpdateReply is called on the coordinator to handle replies from Up- 
dateRequest. 



algorithm [12] to remove retired updates without being confused by out-of-order 
update messages. Here, the node simply waits for MAXDELAY seconds before 
deleting a retired update (Fig. 6, RemoveUpdate). MAXDELAY is the sum of 
the maximum clock skew among nodes and the message lifetime, an interval 
long enough that almost all the messages will arrive at the destination within 
the interval. 

This update removal scheme additionally requires each node to discard stale 
incoming network messages (Fig. 6 MessageArrived). Here, each network message 
is stamped with the sender’s clock value and is accepted by the receiver only 
when its timestamp is no older than MAXDELAY on the receiver’s clock. 

3 Examples 

We show two examples to illustrate the behavior of the algorithm. The first 
example is a simple contents update. The second example demonstrates the 
node discovery process used to resolve confiicting replica set changes. 

Fig. 7 shows a sequence of steps performed to update an object replicated 
on nodes A, B, and C. 

[1) A issues an update U:(target=peer={A,B,C}) and modifies its replica. 

[2) A pushes U to B and C. 

[3) B applies U and returns {Uts, true, {A,B|, {A,B,C}) to A. C applies U and 
returns {Uts, true, {A,Cj, {A,B,C}) to A. 
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public proc PushRetire() 

if gU^NULL A gU, state = RETIRING 
if gU.done=gU. target 
gRetireTime ^ Now() 
gU.state ^ RETIRED 
elif lAmCoordinator(g’U) 

foreach node € (gU. target - gU.done) 
Send(nodej RetireRequest^ gU.ts) 
public proc RetireRequest(caiieA ts) 
if ^UpdateOverwritten(ts) 
gRetireTime <— Now() 
gU. state RETIRED 

gSavedData ^ NULL 
Send(caiieA RetireReply^ me^ ts) 



public proc RetireReply(nodej ts) 
if ^UpdateOverwritten(ts) 
gU.done ^ gU.done U {node} 
public proc RemoveUpdate() 

if gfJ^NULL A g[J.state=RETIRED 
A NowQ > gRetireTime+MAXDELAY 
gU^NULL I Delete the update. (7) 

public proc MessageArrived(msg) 
if msg.ts < CurClock() ~ MAXDELAY 
return | message too old. just ignore it 
dispatch msg 



Fig. 6. Update retirement. PushRetire is called periodically to push retirement notices 
to participants. RetireRequest is executed on remote node in response to PushRetire. 
RetireReply is called on the coordinator to handle replies from RetireRequest. Re~ 
moveRequest is called periodically to remove retired updates. Message Arrived is called 
for every incoming message to discard messages that are too old. 

© © A 

(1) (2) (3) 

Fig. 7. An object replicated on nodes and C is updated. Gray circles indicate 

that nodes that have applied the update. The letter ‘U” indicates that the update is 
logged on the node. 




J4) A receives the replies from B and C and changes Estate to RETIRING. A 
sends U^s retirement to B and C. 

[b) B and C change Estate to RETIRED and reply to A. A receives the replies 
from B and C and changes Estate to RETIRED. 

[6) MAXDELAA^ seconds later, A, B, and C erase E from gU. 

Eig. 8 shows a scenario in which an object replicated on nodes A and B is 
updated concurrently, first by A that adds C to the replica set, and next by B 
that adds D to the replica set. We assume that B’s update is newer than AN. 

[1) A issues U^:(target=peer={A,B,C|) and modifies its gPeer. Simultaneously, 
B issues E^: (target=peer={A,B,D|) and modifies its gPeer. 

[2) A pushes E^ to B and C. B pushes E^ to A and D. Now, for the sake of 
explanation, suppose C and D receive the updates before B and A do. 

[3) C creates a replica and replies true, {A,Cj, {A,B,C}) to A. D creates 

a replica and replies {E^.ts, true, {B,D|, {A,B,D}) to B. 

[4) B receives E^ from A. Because E^ts < E^.tSj B rejects E^ and replies {E^tSj 
false, {A,B| {A,B,C,D}) to A. BN E^.target becomes {A,B,C,D| (Eig. 4 (1)). 
On receiving the reply from B, A changes E^state to SUSPENDED. 
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Fig. 8. Conflicting updates involving replica addition. Gray circles are nodes that 
applied and diagonally shaded circles are nodes that applied . 



J5) A receives from B. Because U^.ts > U^.tSj A modifies gPeer^ replaces 
gU with and replies true, {A,B}, {A,B,C,D}) to B. Later, A may 

receive a reply for from C, but A ignores the reply (Fig. 5 (5)). 

[6) B receives the reply from A and pushes to C, the only target not yet 

contacted. On reception of j C recognizes that it is not a part of the new 
replica set, removes its replica, and replies true, {A,B,C}, {A,B,C,D}) 
to B. 

J7) B receives the replies from C and D. B changes U^.state to RETIRING and 
pushes retirement to A, C, and D. A, C, and D acknowledge the retire- 
ment. 

[S) MAXDELAY seconds later. A, B, C, and D erase from gU. In the end, 
A, B, and D store replicas. C created a replica and later deleted it. 

4 Correctness Proof 

While being simple, our algorithm contains several subtleties, especially regard- 
ing replica additions and deletions. For example, how does it guarantee that all 
replicas receive an update, when another update is adding replicas concurrently? 
In this section, we prove two main safety properties of the algorithm: all nodes 
receive the newest update at the end of propagation, and no stale updates are 
accepted by nodes regardless of concurrent updates. We also argue the liveness 
of the algorithm, i.e., all the replicas will receive the newest update, by applying 
our safety arguments. 

The state of the system can be viewed as a directed graph, called the knowl- 
edge graph, in which the vertices represent nodes and the edges represent the 
nodes^ knowledge of others through gPeer and gU .targets. The graph is usu- 
ally complete (i.e., no update is issued and the value of gPeer is identical on all 
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Symbol 


Meaning 


n:var 

n:lItarget,T 

n:Udone,T 

G 'T 

rii ri2 

G ■j' 

m ^ ri 2 


The value of variable var on node n. 

The value of lltarget on node n just before T. 

The value of Udone on node n just before T. 

m knows m in Gt, i.e., (ni ^ n2) 6 Egt- 

A path from m to U2 exists, i.e., m ^ U2 V ni ^ A n2- 



Table 1. Notational conventions used in the proof 



the replicas), but it becomes incomplete during replica addition or deletion. At a 
high level, our proof shows that the algorithm ripples the newest update through 
the graph, adding edges to the graph along the way to cover all the nodes and 
to restore the completeness of the graph eventually. 

Following are notations used in the ensuing proof. Other symbols are sum- 
marized in Table 1. 

— A node “stores a replica^^ when gData^NVLL. 

— A node “has an update^^ when its gtJ. state is ACTIVE or RETIRING. 

— A node retires an update^^ when it sets glJ. state to RETIRED. 

— A node ni ^^knows^^ another node ri2 when either rii has an update and ri2 € 
ni:gI7. target, or ri2 € riiigPeer. 

— Gt = {Vqti ^ knowledge graph for the object at time T is defined as 

follows. 

Vqt = Nodes that store a replica or have an update. 

Eqt = {vi ^ U2 I {ui,U2} C Vqt Aui knows ^2} 

— St = (VsjrjEsjr), an induced subgraph of Gt, excludes from Gt vertices that 
correspond to failed nodes and the associated edges. St shows the knowledge 
graph in the presence of failure. 

4.1 Correctness Criteria 

Ideally, we want to prove that the algorithm keeps all the live replicas consistent 
regardless of types of failures. Such a guarantee, however, is impossible when 
nodes or links fail in a way that makes the corresponding induced subgraph 
disconnected. Eor example, suppose two nodes fail simultaneously after both 
have created two new replicas, as illustrated in Eig. 9. After such a failure, any 
update issued on the two new replicas will not reach each other, and the new 
replicas will remain inconsistent until the original two nodes recover. Therefore, 
we define the correctness only under the condition that a knowledge subgraph 
is at least weakly connected. 



^ All times mentioned in the proof are hypothetical global times observed by an ex- 
ternal agent. 
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(1) (2) (3) (4) 



Fig. 9. A scenario that disconnects a graph. The edges show knowledge among nodes. 
(1) Replicas A and B initially know each other. (2) A issues Ui and creates a replica 
C. (3) B issues U 2 and creates a replica D. (4) A crashes before Ui is propagated to B 
or D. B crashes before IJ 2 is propagated to A or C. In the end, two components ^ {C} 
and {D}j are both live but disconnected. 



Correctness criteria; Suppose St^ is weakly connected^ and no node 
or link fails and no new update is issued during a long enough period 
Let U be the newest update generated before The algorithm 
is correct if the following conditions hold. 

(1 ) Every node n € Upeer applies U before Te . 

(2) No node n ^ Upeer stores a replica at 

(3) No update older than U (i.e.^ Uits < Uts) is applied on any node 
after U is applied. 

Notice that these criteria demand, in case of a graph disconnection, a replica 
consistency within each partition, and that as soon as the partitions re-integrate, 
all the replicas converge onto the globally newest state. Indeed, this set of criteria 
is as strong as any non-blocking replication algorithm can guarantee. 

4 . 2 Graph Invariants 

Theorem 1. If St is strongly connected^ then > T, St^ is also strongly 
connected if no node or link fails during the period {T.T^). 

Theorem 2. If St is weakly connected^ then VT^ > T, St^ is weakly connected 
if no node or link fails during the period {T.T^). 

Proof sketch. We show by induction that no transition on the subgraph can 
disconnect the subgraph. First, a replica creation will not disconnect the graph 
because a new replica is always created by “stemming onU from an existing 
replica (Section 2.4). Next, replica deletion does not change the graph shape 
because the update record is still stored on the same node. Finally, update 
retirement will not disconnect the graph because an update retires only after 
all the peer replicas have spanned the edges to one another. Theorem 2 can be 
proved exactly the same way. ■ 

Theorem 3. Gt is strongly connected for all T. 

Proof. The graph is clearly connected in the base case in which no replica exists. 
Thus, Gt is connected for all T from Theorem 1. ■ 
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4.3 All Replicas Receive the Newest Update 

Now, we prove that all the nodes receive the newest update by distinguishing 
two cases. Theorem 4 proves that if an update retires, all the nodes must have 
received the update. Theorem 5 proves that when an update is unable to retire 
because some of its targets are dead, all the remaining nodes still receive the 
update. These two theorems together prove the correctness criteria (1) and (2). 
Theorem 4. Suppose St^ is weakly conneeted^ and no node or link fails and no 
new update is issued during a long enough period Let U be the newest 

update generated before Te . If a coordinator c begins the retirement of U at time 
T {T < Te), then C c:Udone,T; that is^ all the nodes in have received 
and applied U . 

Proof. We only need to prove that Q c:lIdone,T; if C all done, t, then 
C Vsjr because no newer update is generated after T and no node possesses 
a stale update after T. 

For the sake of contradiction, suppose ^ alIdone,T. 

Because St is weakly connected from Theorem 2, we can pick a node p € 
aUdone,T such that 3n € Vst —c:Udone,T and that either n Wp or pWn. Below, 
we show that no such pair of p and n can exist. 

First, suppose a pair (p, n) with an 

edge pWn exists (Fig. 10 (a)). Let Tp 
be the time c propagated U to p {Tp < 

T). First, the edge p% n must have 
been created at or before Tp, because 
U is the newest update and any other 

update that could have created p^n ^ coordinator c may fail to con- 

would have been rejected by p after Tp. a node n in two situations. 

On the other hand, p^S n must have 

been created after Tp; otherwise, the edge c%n must be in the graph (Fig. 5 
(6)). Because of this contradiction, this pair (p, n) cannot exist. Second, suppose 
only a pair (p, with an edge n^%p exists (Fig. 10 (b)). From Theorem 3, a 

path exists in the full graph Gt (remember, Gt may include dead nodes). 

Therefore, there exists a dead or uncommunicative node q € alltarget, t along the 
path C'SL rd ^ and q makes U unable to retire in the first place. Therefore, this pair 
nodes (p, n^) cannot exist as well. Therefore, for U to retire, C aUdone,T, ■ 

Theorem 5. Suppose St^ is weakly connected^ and no node or link fails and no 
new update is issued during a long enough period (r^,Te). Let U be the newest 
update generated before T^, If {c:lJAarget,Te — c:lIdone,T^) D — <j> on a coor- 
dinator node c, then C c:lldone,a. 

Proof. For the sake of contradiction, suppose Vst^ 2 aUdone,a. Using the 
argument that appeared in the previous proof, we can pick a node p € aUdone,a 

such that — all done, t and either n ^ p or p ^ n, and we can show 

that a pair (p, n) with an edge p%n cannot exist. 
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Now, suppose only a pair (p, n^) with an edge n^%p exists. For the moment, 
lets assume the following lemma holds for any pair of nodes, rii and ri 2 . 

Lemma 1. Ifrii%n 2 but not n 2 ^ui^ then ni has an update. 

From this lemma, rd has an update, say, Uh Thus, rd contacts p, which in 
turn causes to discover c (Fig. 5 (6)), which in turn cause to propagate 
to c, thereby letting c discover rd before Te (Fig. 4 (1)). Therefore, a pair (p, n) 
with an edge n%p cannot exist as well. Thus, C c:Udone,T^. ■ 

We sketch the proof of Lemma 1. Edges can disappear only when an update 
retires. For an update to retire, all the target nodes need to reply, that is, edges 
must span between any pair of the target nodes. Thus, when ni % U 2 but not 
Wrii, then rii must have an update. ■ 

4.4 No Node Receives a Stale Update 

Theorem 6. Suppose no update is generated for a long period ending at%. Let 
U be the newest update issued before , After U retires^ no update older than 
U is applied on any node. 

Proof. A node may receive an older update after it retires U for two potential 
reasons: (1) another node that has not received U propagates a stale update, or 
(2) a network delay causes a message containing a stale update to be received 
after U retires. Theorem 4 prevents the case (1). The use of synchronized clocks, 
described in Section 2.7, prevents the case (2) (refer to [12] for the full proof). 

4.5 Liveness 

Liveness is derived immediately from the work of the algorithm. The algorithm 
sends update or retire messages to the nodes in gU, target until it receives the 
replies from all the nodes. G 17. target may grow as a result of reply processing 
(Fig. 4 (1)), but because its size is finite, the coordinator will eventually push 
the update to all the nodes it can communicate with^ . 

5 Extensions 

5.1 Supporting Multiple Objects 

All the discussions so far have focused on a single object, but in fact, the basic 
algorithm can be extended easily to support multiple objects. To support multi- 
ple objects, instead of the variables glJ, gSavedData, and gRetireTime, we now 
have a persistent table that partially maps an object ID to an update in progress 
for the object. An update is added to the table when an object is going to be 
modified (Fig. 4 (3)), and is deleted from the table when it is removed (Fig. 6 
(7)) or is superseded by a newer update for the same object (Fig. 4 (3)). 

^ Here, we assume a bounded message transmission delay. Otherwise, no algorithm 
can ensure liveness. 
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5.2 Designated Coordinator 

The basic algorithm presented so ^ — rTTT 

far IS inefficient because it floods an lAmCoordinator(u); bool 

update among all the target nodes in return u,ts,nid = me 
PnshUpdate, causing an update to be V min{members fi u. target) = me 

sent as many as {N — 1)^ times {N is 

the number of replicas). By combining Fig. 11. Designated coordinator selection, 
the use of a group membership service ^ membership service stores the set of pre- 
[5, 22] and a simple change to the func- s^rned live nodes in members. 
tion lAmCoordinator (Figures 5 and 11), however, we can reduce the cost to 
iV — 1 in the common case. In the new implementation, a node pushes or retires 
an update only when it is the issuer of the update or when it is designated to 
take over the failed issuer. Notice that because the membership service is shared 
by all the objects hosted on a node, its cost is amortized over many runs of the 
algorithm and becomes negligible. 

5.3 Delaying Update Retirements 

The algorithm is further optimized by delaying and aggregating calls to PushRe- 
tire for different objects to the same node. Delaying calls to PushRetire does not 
affect the replica consistency; it merely delays the deletion of the update record 
and increases the size of the update table. 

5.4 Optimistic Deltas 

Instead of pushing the entire object state every time, we can send optimistic 
deltas [2] to save the network and the computational cost. Here, a coordinator 
simply pretends that all the replicas for the object were consistent before the 
update and pushes only the difference between the old and the new contents 
(called the optimistic delta) along with the Gngerprint of the old replica contents. 
On the receiver side, a node applies the update when its replicaN fingerprint 
matches the updated; otherwise, the node requests a full contents transfer from 
the coordinator. This technique can reduce the cost of the algorithm, especially 
during replica set changes, in the common case without concurrent updates. 

Fingerprint is any short bit-string that summarizes the replica contents. Ap- 
plying a collision-resistant hash function (e.g., MD5) on the replica contents is 
one way to compute a fingerprint. A faster, more accurate, but slightly more 
space-consuming alternative is to store along with each replica a timestamp 
(Figures 1 and 4) that shows the last time the replica was modified, and to use 
the timestamp as a fingerprint. 

5.5 Handling Long-term Failures 

In the real world, computers often crash and never recover. Such nodes create 
an unbounded amount of backlog of updates that eventually fill up the disks on 
other nodes. Our algorithm handles such a situation automatically by purging 
nodes that remain down for too long. 
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When a node finds another node dead for more than a predefined purge period 
(e.g., one week), it pretends that it received affirmative replies from the dead 
node for all its jammed updates. The node then purges the dead node simply by 
removing the dead node’s name from the replica sets of all the replicas stored on 
the node. To avoid having inconsistent data, when a node recovers after being 
down for the purge period or longer, it clears its disk contents and rejoins the 
cluster with a new node ID. 

The only remaining problem is when nodes or links fail in such a way as 
to make the knowledge graph disconnected, and they remain failed until the 
purge deadline. In such case, the aforementioned scheme may make replicas 
permanently inconsistent. We argue below that such a scenario is highly unlikely 
to happen in practice. 

How can a graph become disconnected? One cause of a graph disconnection 
is link failures (i.e., network partitioning). Another cause, which may happen 
without link failures, is multiple node failures combined with concurrent replica 
additions, illustrated in Fig. 9. Now, can a graph disconnection last until the 
purge period? The answer is no, for all practical purposes. First, a network 
partitioning would never last long because it is repaired simply by installing re- 
placement parts. Second, the latter failure scenario would not happen in practice 
because it requires a combination of simultaneous replica creations and coinci- 
dental sudden long-term failures of multiple nodes — the window of vulnerability 
is very narrow for both. 

6 Performance 

6.1 Networking and Computational Overhead 

With the optimizations described in Section 5.2, our algorithm pushes an update 
to N replicas in the common case and aggregates retirement notices into one 
batch notice. In total, the algorithm sends 2(1 + ^)iV messages per update, where 
G is the average aggregation factor for retirement notices. In Porcupine, the value 
of G is around 20 under heavy load, and the networking and the processing costs 
of our algorithm is close to 2N per update, which is the optimal number for an 
algorithm that does not batch (and thus delay) update propagation — iV + iV 
messages are always needed to propagate and acknowledge an update to N nodes. 

6.2 Space Overhead 

This algorithm stores two types of data structures per replica in addition to the 
contents: the replica set {gPeer in Fig. 1), and the update record (glJ, gSaved- 
Data, and gRetireTime in Fig. 1). 

The replica set information consumes small space - typically a few bytes per 
replica - and it is stored only when the replica itself is present. 

An update record is stored on disk only while the update is in progress. 
The space consumed by update records on a node is (5 + aMfR)UD. Here, 
{S + aMjR) shows the average space overhead of an update record on a node: S 
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is the size of gU and gRetireTime, a is the proportion of updates that shrink the 
replica set size, M is the average object size, and R is the average replication 
factor for objects — thus, aMjR shows the time-averaged space overhead of 
gSavedData. U is the average number of objects updated per second and D is 
the average update lifetime, including the deletion wait period (Section 2.7) and 
delays introduced by retirement-aggregation (Section 5.2). In Porcupine, S' ^ 60 
bytes, a ^ 1/50, M ^ 5000 bytes, R ^ 2, 7/ ^ 30, and D ^ 120. Thus, total 
amount of stable storage used is about four hundred kilobytes. 

7 Conclusions 

We have described a new decentralized replication algorithm designed for Inter- 
net servers in this paper. Following are key features of our algorithm. 

— Eventual consistency under most failure types, e.g., node and link node fail- 
ures and sudden node retirements. 

— Any replica can issue updates any time. 

— Support for dynamic replica addition and deletion. 

— Minimal space overhead, especially, efficient object deletion. 

— Minimal computational and networking overhead in the common case. 

As future work, we plan to investigate the space, time, and computation com- 
plexities of the algorithm under update conflicts. In addition, we are studying the 
implementation of semantically richer operations, e.g., multi-object transactions, 
on top of our algorithm. 
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Abstract. In this paper, we explore data replication protocols that pro- 
vide both fault tolerance and good performance without compromising 
consistency. We do this by combining transactional concurrency control 
with group communication primitives. In our approach, transactions are 
executed at only one site so that not all nodes incur in the overhead 
of producing results. To further reduce latency, we use an optimistic 
multicast technique that overlaps transaction execution with total order 
message delivery. The protocols we present in the paper provide correct 
executions while minimizing overhead and providing higher scalability. 



1 Introduction 

Conventional algorithms for database replication emphasize consistency and 
fanlt tolerance instead of performance [1]. As a result, database designers ignore 
these algorithms and use lazy replication instead, thereby compromising both 
fanlt-tolerance and consistency [2] . A way out of this dilemma [7, 6] is to com- 
bine database replication techniques with group communication primitives [4]. 
This approach has produced efficient eager replication protocols that guarantee 
consistency and increase fanlt tolerance. However, in spite of some suggested 
optimizations [9, 10], this new type of protocols still have two major drawbacks. 
One is the amount of redundant work performed at all sites. The other is the high 
abort rates created when consistency is enforced. In this paper, we address these 
two issues. First, we present a protocol that minimizes the amount of redundant 
work. Transactions, even those over replicated data, are executed at only one 
site. The other sites only install the final changes. With this, and unlike in ex- 
isting replication protocols, the aggregated computing power actually increases 
as more nodes are added. This is a significant advantage in environments with 
expensive transaction processing (e.g., dynamic web pages). A negative aspect of 
this protocol is that it might abort transactions in order to guarantee serializabil- 
ity. To reduce the rate of aborted transactions while still providing consistency, 
we propose a second protocol based on a transaction reordering technique. 

The paper is organized as follows. Section 2 introduces the system model. 
Sections 3 and 4 describe the algorithms. Section 5 discusses fault tolerance 
aspects. Section 6 contains the correctness proofs. Section 7 concludes the paper. 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 315-329, 2000. 

(c) Springer- Verlag Berlin Heidelberg 2000 
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2 System Model 

In a replicated database, a group of nodes N — {iVi, iV 2 , iV„}, each containing 
the entire database, commnnicate by exchanging messages. Sites only fail by 
crashing (no byzantine failures) and there is always at least one node available. 



2.1 Communication Model 

The system uses various group communication primitives [4]. Regarding message 
ordering, we use a multicast primitive not providing any order, a primitive pro- 
viding FIFO order (messages of one sender are delivered in FIFO order) and one 
providing a total order (all messages are delivered at all sites in the same order). 
In regard to fault-tolerance, we use both a reliable delivery service (whenever 
a message is delivered at an available site it will be delivered at all available 
sites) and a uniform reliable delivery service (whenever a message is delivered at 
any faulty or available site it will be delivered at all available sites). We assume 
a virtual synchronous system, where all group members perceive membership 
(view) changes at the same virtual time, i.e., two sites deliver exactly the same 
messages before installing a new view. 

We use an aggressive version [9] of the optimistic total order broadcast pre- 
sented in [10]. Each message corresponds to a transaction. Messages are opti- 
mistically delivered as soon as they are received and before the definitive ordering 
is established. With this, the execution of a transaction can overlap with the cal- 
culation of the total order. If the initial order is the same as the definitive order, 
the transactions can simply be committed. If the final order is different, addi- 
tional actions have to be taken to guarantee consistency. This optimistic broad- 
cast is defined by three primitives [9]. To~broadeast{m) broadcasts the message 
m to all the sites in the system. Opt- deliver (m) delivers message m optimisti- 
cally to the application (with no order guarantees). To- deliver (m) delivers m 
definitively to the application (in a total order). This means, messages can be 
opt-delivered in a different order at each site, but are to-delivered in the same 
total order at all sites. A sequence of opt-delivered messages is a tentative or- 
der, A sequence of to-delivered messages is the definitive order or total order. 
Furthermore, this optimistic multicast primitive ensures that every to-broadcast 
message is eventually opt-delivered and to-delivered by every site in the system. 
It also ensures that no site to-delivers a message before opt-delivering it. 

2.2 Transaction Model 

Clients interact with the database by issuing transactions, i.e., partially ordered 
sets of read and write operations. Two transactions confiict if they access the 
same data item and at least one of them is a write. A history H of committed 
transactions is serial if it totally orders all transactions. Two histories Hi and 
H 2 are confiict equivalent, if they are over the same set of transactions and order 
confiicting operations in the same way. A history H is serializable, if it is confiict 
equivalent to some serial history [1]. For replicated databases, the correctness 
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criterion is 1-copy-serializability [1]. Using this criterion, each copy must appear 
as a single logical copy and the execution of concurrent transactions must be 
equivalent to a serial execution over all the physical copies. 

In this paper, concurrency control is based on conflict classes [9]. Each con- 
flict class represents a partition of the data. Transactions accessing the same 
conflict class have a high probability of conflicts, as they can access the same 
data, while transactions in different partitions do not conflict and can be ex- 
ecuted concurrently. In [9] each transaction must access a single basic conflict 
class (e.g., Cx)- We generalize this model and allow transactions to access com- 
pound conflict classes. A compound conflict class is a non-empty set of basic 
conflict classes (e.g., {Cx^Cy}). We assume that the (compound) conflict class 
of a transaction is known in advance. Each site has a queue CQx associated to 
each basic conflict class (7^.. When a transaction is delivered to a site, it is added 
to the queues of the basic conflict classes it accesses. This concurrency control 
mechanism is a simplifled version of a lock table [3]. 

Each conflict class has a master site. We use a read- one/ write- all available 
approach. Queries (read only transactions) can be executed at any site using 
a snapshot of the data (i.e., they do not interfere with update transactions). 
Update transactions are broadcast to all sites, however they are only executed 
at the master site of their conflict class. We say a transaction is local to the 
master site of its conflict class and is remote everywhere else. 

3 Increasing Scalability 

3.1 The Problem and a Solution 

The scalability of data replication protocols heavily depends on the update ratio. 
To see why, consider a centralized system capable of processing t transactions 
per second. Now assume a system with n nodes, all of them identical to the 
centralized one. Assume that the fraction of updates is w. Assume the load 
of local transactions at a node is x transactions per second. Since nodes must 
also process the updates that come from other nodes, the following must hold: 
X w {n — 1) X — t, that is, a node processes x local transactions per second, 
plus the percentage of updates arriving at other nodes {w x) times the number 
of nodes. From here, the number of transactions that can be processed at each 
node is X = t (1 + tc (n — 1))™^. The total capacity of the system is n times 
that expression which yields, with t normalized to 1, n (1 + tc (n — 1))™^. This 
expression has a maximum of n when w — 0 (there are no updates) and a 
minimum of 1 when w = 1 (all operations are updates). 

Thus, as the update factor w approaches 1, the total capacity of the system 
tends to that of a single node, independently of how many nodes are in the 
system. Note that the drop in system capacity is very sharp. For 50 nodes, 
w = 0.2 (20% updates) results in a system with a tenth of the nominal capacity. 

This limitation can be avoided if transactions execute only at one site (the 
local site) and the other sites only install the corresponding updates. This re- 
quires signiflcantly less than actually running the transactions as has been shown 
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in [8]. In order to guarantee consistency, the total order established by the to- 
delivery primitive is used as a guideline to serialize transactions. All sites see 
the same total order for update transactions. Thus, to guarantee correctness, it 
suffices for a site to ensure that conflicting transactions are ordered according to 
the deflnitive order. Transactions can be executed in different orders at different 
sites if they are not serialized with respect to each other. 

When an update transaction T is submitted, it is multicast to all nodes. This 
message contains the entire transaction and it is flrst opt-delivered at all sites 
which can then proceed to add the corresponding entries in the local queues. 
Only the local site executes T : whenever T is at the head of any of its queues 
the corresponding operation is executed on a shadow copy of the data. With this, 
aborting a transaction simply requires to discard the shadow copies. When the 
transaction commits the shadow copies become the valid versions of the data. 

When a transaction is to-delivered at a site, the site checks whether the 
deflnitive and tentative orders agree. If they agree, the transaction can be com- 
mitted after its execution has completed. If they do not agree, there are several 
cases to consider. The flrst one is when the lack of agreement is with non- 
conflicting transactions. In that case, the ordering mismatch can be ignored. If 
the mismatch is with conflicting transactions, there are two possible scenarios. 
If no local transactions are involved, the transaction can simply be resched- 
uled in the queues before the transactions that are only opt-delivered but not 
yet to-delivered. With this, to-delivered transactions will then follow the deflni- 
tive order. If local transactions are involved, the procedure is similar but local 
transactions (that have been executed in the wrong order) must be aborted and 
rescheduled again (by putting them back in the queues in the proper order). 

Once a transaction is to-delivered and completely executed the local site 
broadcasts the commit message containing all updates (also called write set 
W5). Upon receiving a commit message (which does not need any ordering 
guarantee), a remote site installs the updates for a certain basic conflict class as 
soon as the transaction reaches the head of the corresponding queue. When all 
updates are installed the transaction commits. 

3.2 Example 

Assume there are two basic conflict classes CxjCy and two sites N and iVb N 
is the master of conflict classes and {Cx^Cy}, is the master of {C^}. 

We denote the conflict class of a transaction by Ct^ . Assume there are three 
transactions, Ct, = Cr, = {CJ and = {C^. That is, Ti and Ta 

are local at N and T 2 is local at iVb The tentative order at N is: Ti.^T 2 jT^ and 
at is: T 2 jT^jTi. The deflnitive order at both sites is: Ti^T 2 ^Tz. When all the 
transactions have been opt-delivered, the queues at each site are as follows: 

At N: At iVb 

CQx^Ti.n CQx^n^Ti 

CQy^Ti,T2 CQy^T2,Ti 

At site iV, Ti can start executing both its operations on Cx and Cy since it 
is at the head of the corresponding queues. When Ti is to-delivered the orders 
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are compared. In this case, the definitive order is the same as the tentative order 
and hence, Ti can commit. When Ti has finished its execution, N will send a 
commit message with all the corresponding updates. N can then commit Ti and 
remove it from the queues. The same will be done for even if, in principle, 
T2 goes first in the final total order. However, since these two transactions do 
not confiict, this mismatch can be ignored. Parallel to this, when N receives 
the commit message for T2 from iV^, the corresponding changes can be installed 
since T2 is at the head of the queue CQy. Once the changes are installed, T2 is 
committed and removed from CQy. 

At site iV^, T2 can start executing since it is local and at the head of its queue. 
However, when Ti is to-delivered, realizes that it has executed T2 out of order 
and will abort P2? moving it back in the queue. Ti is moved to the head of both 
queues. Since is remote at iV^, moving Ti to the head of the queue CQ^ does 
not require to abort T^. Ti is now the first transaction in all the queues, but it 
is a remote transaction. Therefore, no transaction is executing at iVh When the 
commit message of Ti arrives at iV^, Pi’s updates are applied, Ti commits and 
is removed from both queues. Then, T2 will start executing again. When T2 is 
to-delivered and completely executed, a commit message with its updates will 
be sent, and T2 will be removed from CQy. 



3.3 The NODO Algorithm 



The first algorithm we propose, Nodo (NOn-Disjoint confiict classes and Op- 
timistic multicast), follows that in [9]. The algorithm is described according to 
the different phases in a transaction’s execution: a transaction is opt-delivered, 
to-delivered, completes execution, and commits. We assume access to the queues 
is regulated by locks and latches [3]. There are some restrictions on when certain 
events may happen. For instance, a transaction can only commit when it has 
been executed and to-delivered. Waiting for the to-delivery is necessary to avoid 
confiicting serialization orders at the different sites. Each transaction has two 
state variables to ensure this behavior: The execution state of a transaction can 
be active (as soon as it is queued) or executed (when its execution has finished). 
A transaction can only become executed at its master site. The delivery state 
can be pending (it has not been to-delivered yet) or committable (it has been 
to-delivered). When a transaction is opt-delivered its state is set to active and 
pending. In the following we assume that whenever a transaction is local and 
the first one in any of its queues, the corresponding operations are submitted for 
execution. 

We assume that each of the phases is done in an atomic step. For instance, 
adding a transaction to the different queues during opt-delivery or rescheduling 
transactions during to-delivery is not interleaved with any other action. Note 
that aborting a transaction simply involves discarding the shadow copy, the 
transaction itself is kept in the queues but in different positions. 
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Upon Opt-delivery of 

Mark as active and pending 
For each conflict class Cx € Ct.^ 
Append to the queue CQx 

EndFor 

Upon TO-delivery of Tv, 

Alark Ti as committable 
If Ti is executed then 
Broadcast commit (WSt..,) 
Else (still active or not local) 
For each Cx € Ct.^ 

If First(CQ.O = Tj 
A LocM(t/) 

A Pending(TA then 
Abort Tj 

Alark Tj as active 

Endlf 

Schedule 7\ before the first 
pending transaction in CQx 

EndFor 

Endlf 



Upon complete execution of Ti 

If Ti is marked as committable then 
Broadcast commit(fU5Tj 

Else 

Alark 7} as executed 

Endlf 

Upon receiving commit ( lU ) 

If ^ Local {Ti) then 

Delay until Ti becomes committable 
For each Cx 6 Cti 

When Ti becomes the first in CQx 
Apply the updates of W St^ 
corresponding to Cx 
Remove Ti from CQx 
EndFor 
Else 

Remove T from all Ct.^ 

Endlf 

Commit Ti 



4 Reducing Transaction Aborts 

In the Nodo algorithm, a mismatch between the local optimistic order and the 
total order may result in a transaction being aborted. The resulting abort rate is 
not necessarily very high since for this to happen, the transactions must conflict, 
appear in the system at about the same time, and the site where the mismatch 
occurs must be the local site where the aborted transaction was executing. In 
all other cases there are no transaction aborts, only reschedulings. Nevertheless, 
network congestion and high loads can lead to messages not being spontaneously 
ordered and, thus, to higher abort rates. The number of aborted transactions 
can be reduced by taking advantage of the fact that Nodo is a form of master 
copy algorithm (remote sites only install updates in the proper order). Thus, a 
local site can unilaterally decide to change the serialization order of two local 
transactions (i.e., follow the tentative order instead of the deflnitive total order), 
thereby avoiding the abort. To guarantee correctness, the local site must inform 
the rest of the sites about the new execution order (by appending this informa- 
tion to the commit message). Special care must be taken with transactions that 
belong to a non basic conflict class (e.g., Ct,; = C^}). A site can only follow 

the tentative order Ti T2 instead of the deflnitive order T2 Ti, if 

Tis conflict class Cti is a subset of T2^s conflict class Ct2 and botho are local 
transactions. Otherwise, inconsistencies could occur. We call this new algorithm 
Reordering as the serialization order imposed by the deflnitive order might be 
changed for the tentative one. 




Scalable Replication in Database Clusters 321 



4 1 Example 

Assume a database with two basic conflict classes Cx and Cy. Site N is the 
master of the conflict classes and {CxjCy}. is the master of conflict class 
{Cy}. To show how reordering takes place, assume there are three transactions 
Cti — = {Cx^Cy}^ and Ct2 — All three transactions are local to N, 

The tentative order at both sites is T2jT^jTi. The deflnitive order is Ti.^T2jT^. 
After opt”delivering all transactions they are ordered as follows at both sites: 
QCx: T 2 ,n.n 
QCy : 

At site Nj T2 and can start execution (they are local and are at the 
head of one of their queues). Assume that Ti is to-delivered at this stage. In 
the No DO algorithm, Ti would be put at the head of both queues which can 
only be done by aborting T2 and T^. This abort is, however, unnecessary since 
N controls the execution of these transactions and the other sites are simply 
waiting to be told what to do. Thus, N can simply decide not to follow the total 
order but serialize according to the tentative order. This is possible because all 
transactions involved are local and the conflict classes of T2 and are a subset 
of Ti’s conflict class. When such a reordering occurs, Ti becomes the serializer 
transaction of T2 and T^. T2 does now not need to wait to be to-delivered to 
commit. Being at the head of the queue and with its serializer transaction to- 
delivered, the commit message for T2 can be sent once T2 is completely executed 
(thereby reducing the latency for T2). The commit message of T2 also contains 
the identifler of the serializer transaction The same applies to 

Site has at the beginning no information about the reordering. Thus, 
not knowing better, when T\ is to-delivered at iV^, will reschedule T\ before 
T2 and T'z as described in the Nodo algorithm. However, when receives 
the commit message of T2, it realizes that a reordering took place (since the 
commit message contains the information that T2 has been serialized before Ti). 

will then reorder T2 ahead of T\ and mark it committable. iV^, however, 
only reschedules T2 when Ti has been to-delivered in order to ensure 1 -copy 
serializability. The rescheduling of will take place when the commit message 
for T3 arrives, which will also contain Ti as the serializer transaction. In order to 
prevent that T2 and are executed in the wrong order at iV^, commit messages 
are sent in FIFO order (note, that FIFO is not needed in the Nodo algorithm). 

As this example suggests, there are restrictions to when reordering can take 
place. To see this, consider three transactions Ct^ — Ct2 — 

Cr3 = {Cx^Cy}. T'l and are local to iV, T2 is local to iVh Now assume that 
the tentative order at iV is T3, Ti, T2 and at it is Ti, T2, The deflnitive 
total order is Ti, T2, T3. After all three transactions have been opt-delivered the 
queues at both sites look as follows: 

Queues at site N: Queues at site Nb 

QCx: T 3 ,ri QCx: 

QCy: n.Th QCy: T 2 ,T^ 

Since is local and it is at the head of its queues, N starts executing . For 
the same reasons, starts executing T2. When Ti is to-delivered at iV, T3 cannot 
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be reordered before T\. Assume this would be done. T?^ would commit and the 
commit message would be sent to iVb Now assume the following scenario at iVb 
Before receives the commit message for both T\ and T2 are to-delivered. 
Since T2 is local, it can commit when it is executed (and the commit is sent to 
N). Hence, by the time the commit message for arrives, will produce the 
serialization order T2 ^ T^. At iV, however, when it receives T2^b commit, it 
has already committed T^. Thus, N has the serialization order ^ T2, which 
contradicts the serialization order at iVb 

This situation arises because Ct^ = {Cx^Cy} is not a subset of CT3 = 
and, therefore, T\ cannot be a serializer transaction for In order to clarify 
why subsets (i.e., the conflict class of the reordered transaction is a subset of 
the conflict class of the serializer transaction) are needed for reordering, assume 
that Ti also accesses Cy (with this, (7^3 Q Cti)- In this case, the queues are: 
Queues at site N: Queues at site W: 

QCx: T3,ri QCx: 

QCy: n.n.n QCy: Ti,T 2 ,n 

The subset property guarantees that Ti conflicts with any transaction with 
which conflicts. Hence, Ti and T2 conflict and will delay the execution 
and commitment of T2 until the commit message of Ti is delivered. As the 
commit message of the reordered transaction will arrive before the one of 
Ti, will be committed before Ti and thus before T2 solving the previous 
problem. This means, that both N and will produce the same serialization 
order ^ Ti ^ T2, 

4.2 REORDERING Algorithm 

In general, the Reordering algorithm is similar to Nodo except in a few points 
(in the following we omit the actions upon opt-delivery since they are they same 
as in the Nodo algorithm). The commit message must now contain the identifler 
of the serializer transaction (denoted as Ser in the algorithm description) and 
follow a FIFO order. As in Nodo, when a transaction T^ is to-delivered, the 
transaction is marked as committable. At T^s local site, any non to-delivered 
local transaction Tj whose conflict class Cxj is a subset of Ct- and that precedes 
in the queues (reorder set RS) is marked as committable (since now the 
commit order is no longer the deflnitive but the tentative order). Thus, it is 
possible that when a reordered transaction is to-delivered the transaction is 
already marked as committable or even has been committed. In this case the 
to-delivery message is ignored. Local non to-delivered conflicting transactions 
that cannot be reordered and have started execution are aborted (abort set, 
AS). When the to-delivered transaction is remote, the algorithm behaves as the 
Nodo algorithm. Note that a remote reordered transaction T^ cannot commit 
at a site until its serializer transaction is to-delivered at that site. When this 
happens, % is rescheduled before its serializer transaction. The rescheduling 
together with the FIFO ordering ensure that remote transactions will commit 
at all sites in the same order in which they did at the local site. 
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Upon to-delivery of transaction T{ 

If ^ Committed(rO A Pending(rO then 
(Ti has not been reordered) 

If Local(Ti) then 
If Ti is marked executed then 
Ser{Ti) = Ti {Ti is its own serializer) 
Broadcast commit ( kU 5er(T0) 

Else (Ti has not finished yet) 

^5 = {TfCr, n (7 t, # 0 a Ct, ^ Ct, 

A 3a € (7 t, n (7 ta Tj = First(CQ.O 
A Pending(r,) A Local(T,)} 

For each Tj e ^5 
(abort eonflieting transaetions that 
eannot be reordered) 

Abort Tj and mark it as active 
EndFor 

(try to reorder transaetions) 

RS = {Tj |(7r,- C (7 t, A Tj ^opt T 
A Pending(r,) A Local(T, )} 

For each Tj e RS U {T} 
in opt“delivery order 
Alark 1) as committable 
Ser{Tj) = Ti (T is serializer of!)) 
Schedule Tj before the first pending 
transaction in all CQx\Tj e Cx 
EndFor 
Endlf 

Else (It is a remote transaetion) 

Alark Ti committable 
For each conflict class Cx € Ct- 
If Tj = First ((7Q.A A PendingfrA 
A Local(7;) then 
Abort Tj and mark it as active 
Endlf 

Schedule 7} before the first transaction 
marked as pending in queue CQx 

EndFor 

Endlf 

Else (transaetion has been reordered) 
Ignore the message 

Endlf 



5 Dealing with Failures 



Upon complete execution of Ti 

If Ti is marked as committable then 
Broadcast commit (IF j Ser{Ti)) 

Else 

Alark 1} as executed 

Endlf 

Upon receiving commit(IF5T, , Ser{7\)) 
If ^ Local(TI) then 
Delay until Ser{Ti) is committable 
If Ti ^ Ser{Ti) then 
Mark 7} as committable 
Endlf 
Endlf 

For each Cx € Cti 
If not Local(Ti) then 
If Ti ^ Ser{Ti) then 
Reschedule T just 
before Ser{Ti) in CQx 
Endlf 

When Ti becomes the first in CQx 
apply the updates of W Sxi 
corresponding to Cx 

Endlf 

Remove 7} from CQx 

EndFor 

Commit Ti 



In onr system, each site acts as a primary for the conflict classes it owns and as a 
backup for all other conflict classes. In the event of site failures, the available sites 
simply have to select a new master for the conflict classes of the failed node. The 
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new master will also take over the responsibility for all pending transactions 
for which the failed node was the owner (i.e., where the commit message has 
not been received by the available sites). Such a master replacement algorithm 
guarantees the availability of transactions in the presence of failures. That is, a 
transaction will commit as far as there is at least one available site. 

For both algorithms, transaction messages must be uniformly multicast be- 
cause only then it is guaranteed that the master will only execute and commit 
a transaction when all sites will receive it, and thus, be able to take over if the 
master crashes (reliable multicast does not provide this since the master can 
commit a transaction which the other sites have not yet received). 

In the Nodo algorithm, commit messages do not need to be uniform. Local 
transactions can even be committed before multicasting the commit message. 
The worst that can happen is that a master commits a transaction and fails 
before the commit message reaches the other sites. When a new master takes 
over, it will reexecute the transaction and send a new commit message. As the 
total order is always followed inconsistencies cannot arise. 

In the Reordering algorithm, commit messages must be uniform and the 
master may not commit the transaction before the commit message is delivered. 
If the commit message were not uniform, a master could reorder a transaction, 
send the commit message and then crash. If the rest of the replicas do not see 
the commit message, they would use a different serialization order (as the failed 
node’s optimistic order is unknown to the other sites). 

6 Correctness 

In this section we prove the correctness (i.e., 1-copy-serializability), liveness, 
and consistency of the protocols. The proofs assume histories encompassing sev- 
eral group views. Important for both protocols is the fact that transactions are 
enqueued (respectively rescheduled) in one atomic step. Hence, there is no in- 
terleaving between transactions and all sites produce automatically serializable 
histories. As a result, in order to prove 1-copy-serializability, it suffices to show 
that all histories are conflict equivalent. Since conflict equivalence requires his- 
tories to have the same set of transactions, we refer in the corresponding proofs 
only to the available sites. 

6.1 Correctness of NODO 

We will show that all sites order conflicting transactions according to the deflni- 
tive total order. 

Definition 1 (Direct confiict). Two transactions Ti andT2 are in direct con- 
flict if they are serialized with respect to each other ^ Ti T2, and there are no 

transactions seriaMzed between them: | Ti T2. 

Lemma 1 (Total order and Serializability in NODO). Let be the 

history produced at site iV, let Ti and T2 be two directly conflicting transactions 
in IfTi T2 then Ti T2. 
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Proof (lemma 1 ): Assume the lemma does not hold, i.e., there is a pair of 
transactions Ti, T2 such that Ti T2 but T2 Ti. The fact that T2 

precedes Ti in the total order means that T2 was to-delivered before Ti. Since Ti 
and T2 are in direct conflict, there was at least one queue where both transactions 
had entries. If Ti T2, then the entry for Ti must have been ahead in the 

queue. However, upon to-delivery of T2, if Ti was the flrst transaction, Nodo 
would have aborted Ti and rescheduled it after T2. If Ti was not the flrst in the 
queue, Nodo would have put T2 ahead of Ti in the queue. In both cases this 
would result in T2 which contradicts the initial assumption. □ 

Lemma 2 (Conflict equivalence in NODO). For any two sites N and 
is conflict equivalent to 

Proof: (lemma 2 ) From Lemma 1 , all pairs of directly conflicting transactions 
in both Hm and are ordered according to the total order. Thus, Hm and 
H]s[f are conflict equivalent since they are over the same set of transactions and 
order conflicting transactions in the same way. □ 

Theorem 1 (ICPSR in NODO). The Nodo algorithm produces Tcopy- 
serializable histories. 

Proof: (theorem 1 ) Since the histories of all available nodes are conflict equiva- 
lent (lemma 2 ) and serializable, the global history is 1 -copy-serializable. □ 

6.2 Liveness of NODO 

Theorem 2 (Liveness in NODO). Each to-delivered transaction Tt eventu- 
ally commits in the absence of catastrophic failures. 

Proof: (theorem 2 ) The theorem is proved by induction. 

Induction Basis: Let be the flrst to-delivered transaction. Upon to-delivery, 
each site places at the head of all its queues. Thus, Tfs master can execute 
and commit and then multicast the commit message. Remote sites will apply 
the updates and also commit 

Induction Hypothesis: The theorem holds for the to-delivered transactions with 
positions n < k, for some fc > 1 , in the deflnitive total order, i.e., all transactions 
that have at most k — 1 preceding transactions will eventually commit. 
Induction Step: Assume that transaction is at position n = fc + 1 in the deflni- 
tive total order when it is to-delivered. Each node places % in the corresponding 
queues after any committable transaction (to-delivered before Tfl and before any 
pending transaction (not yet to-delivered). All committable transactions that are 
now ordered before have lower positions in the deflnitive total order. Hence, 
they will all commit according to the induction hypothesis and be removed from 
the queues. With this, % will eventually be the flrst in each of its queues and, 
as in the induction basis, eventually commit. 

In all cases, if the master fails before the other sites have received the commit, 
the new master will reexecute and resend the commit message. □ 
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6.3 Consistency of NODO 

Failed sites obviously do not receive the same transactions as available sites. Let 
T be the subset of transactions to-delivered to a node before it failed. 

Theorem 3 (Consistency of failed sites with NODO). All trans actions ^ 
^ (>QYrmiiUed at a failed node N are committed at all available 

nodes. Moreover^ the committed projection of the history in N is eonflict equiva- 
lent to the eommitted projection of the history of any of the available nodes when 
this history is restricted to the transaetions in T. 

Proof: (theorem 3) A transaction T^ can only commit at N when it is to- 
delivered. Since we use uniform reliable delivery % will also be to-delivered and 
known at all available sites. If % was not local at iV, then N must have received 
a commit message from Xfs master. If this master is available for sufficient time 
all other available sites will also receive the commit message. If the master fails 
a new master will take over, execute and resend the commit. This procedure 
will repeat if the new master also fails before the rest of the system receives the 
commit message. Since we assume there are some available nodes, eventually 
one of these nodes will become the master and the transaction will commit. If 
the transaction was local at iV, the same argument applies. The equivalence of 
histories follows directly from Lemma 2. □ 

6.4 Correctness of REORDERING 

In the Reordering algorithm it is not possible to use the total order as a 
guideline since nodes can reorder local transactions. Thus, we start by proving 
that transactions not involved in a reordering cannot get in between the serializer 
and the transaction being reordered. Let Ts be the serializer transaction of the 
transactions in the set 7r, • 

Lemma 3 (Reordered). A reordered transaetion % is always serialized before 
its serializer transaetion Tg^ that is^ if% G 7r., then % Tg, 

Proof (lemma 3): It follows trivially from the algorithm. □ 

Lemma 4 (Serializer in fransaetions € Tm 

there is no transaetion Tp Tj ^ 7r.,? such that % Tj Tg, 

Proof (lemma 4): Assume that N is the master site where the reordering takes 
place. Since Tg is the serializer of T^ —^Qpj^ Tg, and Tg Addition- 

ally, from Lemma 3 % Tg, There are two cases to consider: (a) Tj —^pq Tg 
and (b) Tg —^pq Tj. 

(a) : since Tj is to-delivered before Tg, in the queues Tj is before T^, and Tt is 
before Tg . With Tj ahead of their queues, T^ and Tg cannot be committed until 
Tj commits. Thus, Tj cannot be serialized in between T^ and Tg. 

(b) : since Tg is to-delivered before Tj and Tj ^ rH sites will put Tg ahead 
of Tj in the queues {Tj cannot have committed because it has not yet been 
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to-delivered), if it was not the case. Since Ct- Q this effectively prevents 
transactions from getting in between and Tg. Any transaction Tj trying to do 
so will conflict with Tg and since Tg has been to-delivered before Tj has to 
wait until Tg commits. By that time, T^ will have committed at its master site 
and its commit message will have been delivered and processed at all sites before 
the one of Tg. Therefore, the flnal serialization order will be T^ Tg Tj. □ 

Lemma 5 (Conflict Equivalence in 

N and is conflict equivalent to 

Proof: (lemma 5) We show that two directly conflicting transactions Ti and T2 
with conflict classes Cti and Ct2 ordered in the same way at N and iVh We 
have to distinguish several cases: 

« Cti C Ti and T2 have the same master iV^^, and T2 T'l: 

(a) If reorders T\ and T2 with respect to the total order, then, from 
Lemma 4, no transaction T^ ^ Tt2 can be serialized in between. The commit for 
Ti will be sent before the commit for T2 in FIFO order. Hence, all sites will then 
execute Ti before T2. 

(b) If follows the total order to commit Ti and T2 , then other sites cannot 
change this order. The argument is similar to that in Lemma 1 and revolves about 
the order in which transactions are committed at all sites. 

« Cti C Ti and T2 have the same master iV^^, and Ti T2: 

(c) If Cti = Ct^ then cases (a) and (b) apply exchanging Ti and T2. 

(d) Otherwise Cti C Ct^. In this case, has no choice but to commit Ti 
and T2 in to-delivery order (the rules for reordering do not apply). From here, 
and using the same type of reasoning as in Lemma 1, it follows that all sites 
must commit Ti before T2. 

♦ either Cti Q Ct2 and Ti and T2 do not have the same master, or Cti ^Ct2 ^ 0 

and neither Ct^ ^ nor Ct2 ^ • 

(e) If T\ or T2 are involved in any type of reordering at their nodes. Lemma 4 
guarantees that there will be no interleavings between the transactions involved 
in the reordering and the other transaction. Thus, one transaction will be com- 
mitted before the other at all sites and, therefore, all sites will produce the same 
serialization order. 

(f) If T\ and T2 are not involved in any reordering, then similar to Lemma 
1, both of them will be scheduled in the same (total) order at all sites and then 
committed. 

♦ Ctx Fi Ct2 — 0* 

(g) If there is no serialization order between T\ and T2 then they do not need 
to be considered for equivalence. 

(h) If there is a serialization order between T\ and T2, it can only be indirect. 

Assume that in iV: T\... ...I2. Between each pair of 

transactions in that sequence, there is a direct conflict. Thus, for each pair, the 
above cases apply and all sites order the pair in the same way. From here it 
follows that Ti and T2 are also ordered in the same way at all sites. □ 
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Theorem 4 TAe Reordering algorithm pro- 

duces 1- copy- serializable histories. 

Proof: (theorem 4) From Lemma 5, all histories are conflict equivalent. More- 
over, they are all serializable. Thus, the global history is 1-copy-serializable. □ 



6.5 Liveness of REORDERING 

Theorem 5 (Liveness in Each to-delivered transaction 

T-^ eventually eommits in the absence of eatastrophie failures. 

Proof: (theorem 5) The proof is by induction. 

Induction Basis: Let be the flrst to-delivered transaction. Upon to-delivery, 
each remote site will place at the head of all its queues. At the local node, 
there might be some reordered transactions before % hence, % will be their 
serializer. All these transactions can be executed and committed, so that will 
eventually be executed and committed. Remote sites will apply the updates of 
the reordered transactions and Tt in FIFO order and will also commit T^. 
Induction Hypothesis: The theorem holds for the to-delivered transactions with 
positions n < k, for some fc > 1, in the deflnitive total order, i.e., all transactions 
that have at most k — 1 preceding transactions will eventually commit. 
Induction Step: Assume that transaction % is at position n = fc + 1 in the 
deflnitive total order when it is to-delivered. There are two cases: 

a) % is reordered. This means there is a serializer transaction Tj with a 
position n < fc in the total order and Tt is ordered before Tj. Since according 
to the induction hypothesis, commits and T is executed and committed before 
Tj at all sites, the theorem holds. 

b) Tt is not a reordered transaction. Tt will be rescheduled after any commit- 
table transaction and before any pending transaction. There exist two types of 
committable transactions rescheduled before Tt. 

[.Not reordered transactions: They have a position n < k and will therefore 
commit and be removed from the queues according to the induction hypothesis. 

ii. Reordered transactions: Each reordered transaction that is serialized by 
transaction Tk ^ Tt will commit before Tk and Tk will commit according to the 
previous point (i). All transactions Tj € Tt- (i-e*, Tt is the serializer) are ordered 
directly before T in the queues (Lemma 3). Let Tk be the flrst not reordered 
transaction before this set of reordered transactions. Tk will eventually commit 
according to the previous point (i), and therefore also all transactions in Tt- and 
Tt itself. 

Failures lead to masters reassignment but do not introduce different cases to 
the above ones. □ 



6.6 Consistency of REORDERING 

Again, let T be the subset of transactions to-delivered to a node before it failed. 
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Theorem 6 (Consistency of failed sites with 

actions^ £ T? that are committed at a failed node N are committed at all 

available nodes. Moreover^ the committed projection of the history in is con- 
flict equivalent to the committed projection of the history of any of the available 
nodes when this history is restricted to the transactions in T, 

Proof: (theorem 6) Since both transaction and commit messages are sent with 
uniform reliable multicast, all transactions and their commit messages in T have 
been to-delivered to all available sites and can therefore commit at all sites. The 
equivalence of histories, follows directly from Lemma 5. □ 

7 Conclusions 

In this paper, we have proposed two replication protocols for cluster based appli- 
cations. These protocols solve the scalability problem of existing solutions and 
minimize the number of aborted transactions. We are currently implementing 
and experimentally evaluating the protocols and, as part of future work, we will 
deploy a web farm with a replicated database built upon these protocols. For 
this purpose we will use TransLib [5], a group-based TP-monitor. 
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Abstract. We present an algorithm, called Disk Paxos, for implement- 
ing a reliable distributed system with a network of processors and disks. 
Like the original Paxos algorithm. Disk Paxos maintains consistency in 
the presence of arbitrary non-Byzantine faults. Progress can be guaran- 
teed as long as a majority of the disks are available, even if all processors 
but one have failed. 



1 Introduction 

Fault tolerance requires redundant components. Maintaining consistency in the 
event of a system partition makes it impossible for a two-component system to 
make progress if either component fails. There are innumerable fault-tolerant 
algorithms for implementing distributed systems, but all that we know of equate 
component with processor. But there are other types of components that one 
might replicate instead. In particular, modern networks can now include disk 
drives as independent components. Because commodity disks are cheaper than 
computers, it is attractive to use them as the replicated components for achiev- 
ing fault tolerance. Commodity disks differ from processors in that they are 
not programmable, so we can’t just substitute disks for processors in existing 
algorithms. 

We present here an algorithm called Disk Paxos for implementing an arbi- 
trary fault-tolerant system with a network of processors and disks. It maintains 
consistency in the event of any number of non-Byzantine failures. That is, the 
algorithm tolerates faulty processors that pause for arbitrarily long periods, fail 
completely, and possibly restart; and it tolerates lost and delayed messages. Disk 
Paxos guarantees progress if the system is stable and there is at least one non- 
faulty processor that can read and write a majority of the disks. Stability means 
that each processor is either nonfaulty or has failed completely, and nonfaulty 
processors can access nonfaulty disks. 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 330-344, 2000. 

(c) Springer- Verlag Berlin Heidelberg 2000 
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Disk Paxos is a variant of the classic Paxos algorithm [3,10,12], a simple, 
efficient algorithm that has been used in practical distributed systems [13, 16]. 
Classic Paxos can be viewed as an implementation of Disk Paxos in which there is 
one disk per processor, and a disk can be accessed directly only by its processor. 

In the next section, we recall how to reduce the problem of implementing 
an arbitrary distributed system to the consensus problem. Section 3 informally 
describes Disk Synod, the consensus algorithm used by Disk Paxos. It includes 
a sketch of an incomplete correctness proof and explains the relation between 
Disk Synod and the Synod protocol of classic Paxos. Section 4 briefly discusses 
some implementation details and contains the conventional concluding remarks. 
An appendix gives formal specifications of the consensus problem and the Disk 
Synod algorithm. Further discussion of the specifications and a sketch of a rig- 
orous correctness proof appear in [5] . 

2 The State-Machine Approach 

The state-machine approach [6, 14] is a general method for implementing an 
arbitrary distributed system. The system is designed as a deterministic state 
machine that executes a sequence of commands, and a consensus algorithm en- 
sures that, for each n, all processors agree on the command. This reduces 
the problem of building an arbitrary system to solving the consensus problem. 
In the consensus problem, each processor p starts with an input value input [p ] , 
and all processors output the same value, which equals input [p] for some p. A 
solution should be: 

Consistent All values output are the same. 

Nonblocking If the system is stable and a nonfaulty processor can commu- 
nicate with a majority of disks, then the processor will eventually output a 
value. 

It has long been known that a consistent, nonblocking consensus algorithm re- 
quires a three-phase commit protocol [15], with voting^ prepare to eommit^ and 
eommit phases. Nonblocking algorithms that use fewer phases don’t guarantee 
consistency. For example, the group communication algorithms of Isis [2] permit 
two processors belonging to the current group to disagree on whether a message 
was broadcast in a previous group to which they both belonged. This algorithm 
cannot, by itself, guarantee consistency because disagreement about whether a 
message had been broadcast can result in disagreement about the output value. 

The classic Paxos algorithm [3, 10, 12] achieves its efficiency by using a three- 
phase commit protocol, called the Synod algorithm, in which the value to be 
committed is not chosen until the second phase. When a new leader is elected, it 
executes the first phase just once for the entire sequence of consensus algorithms 
performed for all later system commands. Only the last two phases are performed 
separately for each individual command. 

In the Disk Synod algorithm, the consensus algorithm used by Disk Paxos, 
each processor has an assigned block on each disk. The algorithm has two phases. 
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In each phase, a processor writes to its own block and reads each other processor’s 
block on a majority of the disks. ^ Only the last phase needs to be executed anew 
for each command. So, in the normal steady-state case, a leader chooses a state- 
machine command by executing a single write to each of its blocks and a single 
read of every other processor’s blocks. 

The classic result of Fischer, Lynch, and Patterson [4] implies that a purely 
asynchronous nonblocking consensus algorithm is impossible. So, real-time clocks 
must be introduced. The typical industry approach is to use an ad hoc algorithm 
based on timeouts to elect a leader, and then have the leader choose the output. 
It is easy to devise a leader-election algorithm that works when the system is 
stable, which means that it works most of the time. It is very hard to make one 
that always works correctly even when the system is unstable. Both classic Paxos 
and Disk Paxos also assume a real-time algorithm for electing a leader. However, 
the leader is used only to ensure progress. Consistency is maintained even if 
there are multiple leaders. Thus, if the leader-election algorithm fails because 
the network is unstable, the system can fail to make progress; it cannot become 
inconsistent. The system will again make progress when it becomes stable and 
a single leader is elected. 



3 An Informal Description of Disk Synod 

We now informally describe the Disk Synod algorithm and explain why it works. 
(A formal specification appears in the appendix.) We also discuss its relation to 
classic Paxos ’s Synod Protocol. Remember that, in normal operation, only a 
single leader will be executing the algorithm. The other processors do nothing; 
they simply wait for the leader to inform them of the outcome. However, the 
algorithm must preserve consistency even when it is executed by multiple proces- 
sors, or when the leader fails before announcing the outcome, and a new leader 
is chosen. 



3.1 The Algorithm 

We assume that each processor p starts with an input value input[p].‘^ As in 
Paxos ’s Synod algorithm, a processor executes a sequence of numbered ballots, 
with increasing ballot numbers. A ballot number is a positive integer, and dif- 
ferent processors use different ballot numbers. For example, if the processors are 
numbered from 1 through A, then processor i could use ballot numbers z, z -h A, 
z -h 2A, etc. A ballot has two phases: 

Phase 1 Choose a value v. 

Phase 2 Try to commit v. 

^ There is also an extra phase that a processor executes when recovering from a failure. 
^ If processor p fails, it can restart with a new value of input[p]. 




Disk Paxos 



333 



In either phase, a processor aborts its ballot if it learns that another processor 
has begun a higher- numbered ballot. In that case, the processor may then choose 
a higher ballot number and start a new ballot. If the processor completes phase 2 
without aborting — ^that is, without learning of a higher- numbered ballot — ^then 
value V is committed and the processor can output it. Since a processor does not 
choose the value to be committed until phase 2, phase 1 can be performed once 
for any number of separate instances of the algorithm. 

To ensure consistency, we must guarantee that two different values cannot be 
successfully committed — either by different processors or by the same processor 
in two different ballots. To ensure that the algorithm is nonblocking, we must 
guarantee that, if there is only a single processor p executing it, then p will 
eventually commit a value. 

In practice, when a processor successfully commits a value, it will write on 
its disk block that the value was committed and also broadcast that fact to the 
other processors. If a processor learns that a value has been committed, it will 
abort its ballot and simply output the value. It is obvious that this optimization 
preserves correctness; we will not consider it further. 

To execute the algorithm, a processor p maintains a record dblock[p] con- 
taining the following three components: 

mbal The current ballot number. 

bal The largest ballot number for which p reached phase 2. 

inp The value p tried to commit in ballot number bal. 

Initially, bal equal 0, inp equals a special value NotAnInput that is not a possible 
input value, and mbal is any ballot number. We let disk[d][p] be the block on 
disk d in which processor p writes dblock [p] . We assume that reading and writing 
a block are atomic operations. 

Processor p executes phase 1 or 2 of a ballot as follows. For each disk d, it 
tries first to write dblock[p] to disk[d][p] and then to read disk[d][q] for all other 
processors q. It aborts the ballot if, for any d and g, it finds disk[d][q].mbal > 
dblock[p].mbal. The phase completes when p has written and read a major- 
ity of the disks, without reading any block whose mbal component is greater 
than dblock[p].mbal. When it completes phase 1, p chooses a new value of 
dblock[p].inp^ sets dblock[p].bal to dblock[p].mbal (its current ballot number), 
and begins phase 2. When it completes phase 2, p has committed dblock[p].inp. 

To complete our description of the two phases, we now describe how processor 
p chooses the value of dblock[p].inp that it tries to commit in phase 2. Let 
blocksSeen be the set consisting of dblock[p] and all the records disk[d][q\ read 
by p in phase 1. Let nonlnitBlks be the subset of blocksSeen consisting of those 
records whose inp field is not NotAnInput. If nonlnitBlks is empty, then p sets 
dblock[p].inp to its own input value input[p]. Otherwise, it sets dblock[p].inp to 
bk.inp for some record bk in nonlnitBlks having the largest value of bk.bal. 

Finally, we describe what processor p does when it recovers from a failure. 
In this case, p reads its own block disk[d][p] from a majority of disks d. It then 
sets dblock[p] to any block bk it read having the maximum value of bk.mbal^ and 
it starts a new ballot by increasing dblock[p].mbal and beginning phase 1. 
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3.2 Why the Algorithm Works 

Suppose processor p can read and write a majority of the disks, and all processors 
other than p stop executing the algorithm. In this case, p will eventually choose 
a ballot number greater than the mbal field of all blocks on the disks it can read, 
and its ballot will succeed. Hence, this algorithm is nonblocking, in the sense 
explained above. 

We now explain, intuitively, why the Disk Synod algorithm maintains consis- 
tency. First, we consider the following shared- memory version of the algorithm 
that uses single- writer, multiple-reader regular registers.^ Instead of writing to 
disk, processor p writes dblock[p] to a shared register; and it reads the values of 
dblock[q] for other processors q from the registers. A processor chooses its bal 
and inp values for phase 2 the same way as before, except that it reads just 
one dblock value for each other processor, rather than one from each disk. We 
assume for now that processors do not fail. 

To prove consistency, we must show that, for any processors p and q^ if p 
finishes phase 2 and commits the value Vp and q finishes phase 2 and commits the 
value then Vp = Vq. Let bp and bq be the respective ballot numbers on which 
these values are committed. Without loss of generality, we can assume bp < bq. 
Moreover, using induction on 6^, we can assume that, if any processor r starts 
phase 2 for a ballot br with bp < br < bq^ then it does so with dblock[r].inp = Vp. 

When reading in phase 2, p cannot have seen the value of dblock[q].mbal 
written by q in phase 1 — otherwise, p would have aborted. Hence p^s read of 
dblock[q] in phase 2 did not follow g’s phase 1 write. Because reading follows 
writing in each phase, this implies that g’s phase 1 read of dblock[p] must have 
followed ^’s phase 2 write. Hence, q read the current (final) value of dblock[p] 
in phase 1 — a record with bal field bp and inp field Vp. Let bk be any other 
block that q read in its phase 1. Since q did not abort, bq > bk.mbal. Since 
bk.mbal > bk.bal for any block bk^ this implies bq > bk.bal. By the induction 
assumption, we obtain that, if bk.bal > bp^ then bk.inp = Vp. Since this is true 
for all blocks bk read by q in phase 1, and since q read the final value of dblock[p]^ 
the algorithm implies that q must set dblock[q].inp to Vp for phase 2, proving 
that Vp = Vq. 

To obtain the Disk Synod algorithm from the shared- memory version, we use 
a technique due to Attiya, Bar-Noy, and Dolev [1] to implement a single-writer, 
multiple reader register with a network of disks. To write a value, a processor 
writes the value together with a version number to a majority of the disks. To 
read, a processor reads a majority of the disks and takes the value with the 
largest version number. Since two majorities of disks contain at least one disk 
in common, a read must obtain either the last version for which the write was 
completed, or else a later version. Hence, this implements a regular register. 
With this technique, we transform the shared-memory version into a version for 
a network of processors and disks. 

^ A regular register is one in which a read that does not overlap a write returns the 
register’s current value, and a read that overlaps one or more writes returns either 
the register’s previous value or one of the values being written [7] . 
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The actual Disk Synod algorithm simplifies the algorithm obtained by this 
transformation in two ways. First, the version number is not needed. The mbal 
and bal values play the role of a version number. Second, a processor p need 
not choose a single version of dblock[q] from among the ones it reads from disk. 
Because mbal and bal values do not decrease, earlier versions have no effect. 

So far, we have ignored processor failures. There is a trivial way to extend 
the shared- memory algorithm to allow processor failures. A processor recovers 
by simply reading its dblock value from its register and starting a new ballot. A 
failed process then acts like one in which a processor may start a new ballot at 
any time. We can show that this generalized version is also correct. However, in 
the actual disk algorithm, a processor can fail while it is writing. This can leave 
its disk blocks in a state in which no value has been written to a majority of 
the disks. Such a state has no counterpart in the shared-memory version. There 
seems to be no easy way to derive the recovery procedure from a shared-memory 
algorithm. The proof of the complete Disk Synod algorithm, with failures, is 
much more complicated than the one for the simple shared-memory version. 
Trying to write the kind of behavioral proof given above for the simple algorithm 
leads to the kind of complicated, error-prone reasoning that we have learned to 
avoid. A sketch of a rigorous assertional proof is given in [5]. 

3.3 Deriving Classic Paxos from Disk Paxos 

In the usual view of a distributed fault-tolerant system, a processor performs 
actions and maintains its state in local memory, using stable storage to recover 
from failures. An alternative view is that a processor maintains the state of its 
stable storage, using local memory only to cache the contents of stable storage. 
Identifying disks with stable storage, a traditional distributed system is then 
a network of disks and processors in which each disk belongs to a separate 
processor; other processors can read a disk only by sending messages to its 
owner. 

Let us now consider how to implement Disk Synod on a network of processors 
that each has its own disk. To perform phase 1 or 2, a processor p would access a 
disk d by sending a message containing dblock[p] to disk d’s owner q. Processor 
q could write dblock[p] to disk[d][p]^ read disk[d][r] for all r ^ p^ and send the 
values it read back to p. However, examining the Disk Synod algorithm reveals 
that there’s no need to send back all that data. All p needs are (i) to know if 
its mbal field is larger than any other block’s mbal field and, if it is, (ii) the bal 
and inp fields for the block having the maximum bal field. Hence, q need only 
store on disk three values: the bal and inp fields for the block with maximum 
bal field, and the maximum mbal field of all disk blocks. Of course, q would have 
those values cached in its memory, so it would actually write to disk only if any 
of those values are changed. 

A processor must also read its own disk blocks to recover from a failure. 
Suppose we implement Disk Synod by letting p write to its own disk before 
sending messages to any other processor. This ensures that its own disk has the 
maximum value of disk[d][p].mbal among all the disks d. Hence, to restart after 
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a failure, p need only read its block from its own disk. In addition to the mbal^ 
bal^ and inp value mentioned above, p would also keep the value of dblock[p] on 
its disk. 

We can now compare this algorithm with classic Paxos’s Synod protocol [10]. 
The mbal^ bal^ and inp components of dblock[p] are just lastTried[p]^ nextBal[p]^ 
and prevVote[p] of the Synod Protocol. Phase 1 of the Disk Synod algorithm 
corresponds to sending the NextBallot message and receiving the LastVote re- 
sponses in the Synod Protocol. Phase 2 corresponds to sending the BeginBallot 
and receiving the Voted replies.^ The Synod Protocol’s Success message corre- 
sponds to the optimization mentioned above of recording on disk that a value 
has been committed. 

This version of the Disk Synod algorithm differs from the Synod Protocol 
in two ways. First, the Synod Protocol’s NextBallot message contains only the 
mbal value; it does not contain bal and inp values. To obtain the Synod Protocol, 
we would have to modify the Disk Synod algorithm so that, in phase 1, it writes 
only the mbal field of its disk block and leaves the bal and inp fields unchanged. 
The algorithm remains correct, with essentially the same proof, under this mod- 
ification. However, the modification makes the algorithm harder to implement 
with real disks. 

The second difference between this version of the Disk Synod algorithm and 
the Synod Protocol is in the restart procedure. A disk contains only the afore- 
mentioned mbal^ bal^ and inp values. It does not contain a separate copy of its 
owner’s dblock value. The Synod Protocol can be obtained from the following 
variant of the Disk Synod algorithm. Let bk be the block disk[d][p] with maxi- 
mum bal field read by processor p in the restart procedure. Processor p can begin 
phase 1 with bal and inp values obtained from any disk block bk\ written by 
any processor, such that bk^ .bal > bk.bal. It can be shown that the Disk Synod 
algorithm remains correct under this modification too. 

4 Conclusion 

4.1 Implementation Considerations 

Implicit in our description of the Disk Synod algorithm are certain assumptions 
about how reading and writing are implemented when disks are accessed over a 
network. If operations sent to the disks may be lost, a processor p must receive 
an acknowledgment from disk d that its write to disk[d][p] succeeded. This may 
require p to explicitly read its disk block after writing it. If operations may 
arrive at the disk in a different order than they were sent, p will have to wait 
for the acknowledgment that its write to disk d succeeded before reading other 
processors’ blocks from d. Moreover, some mechanism is needed to ensure that 
a write from an earlier ballot does not arrive after a write from a later one, 

^ In the Synod Protocol, a processor q does not bother sending a response if p sends 
it a disk block with a value of mbal smaller than one already on disk. Sending back 
the maximum mbal value is an optimization mentioned in [10]. 
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overwriting the later value with the earlier one. How this is achieved will be 
system dependent. (It is impossible to implement any fault-tolerant system if 
writes to disk can linger arbitrarily long in the network and cause later values 
to be overwritten.) 

Recall that, in Disk Paxos, a sequence of instances of the Disk Synod algo- 
rithm is used to commit a sequence of commands. In a straightforward imple- 
mentation of Disk Paxos, processor p would write to its disk blocks the value of 
dblock[p] for the current instance of Disk Synod, plus the sequence of all com- 
mands that have already been committed. The sequence of all commands that 
have ever been committed is probably too large to fit on a single disk block. 
However, the complete sequence can be stored on multiple disk blocks. All that 
must be kept in the same disk block as dblock[p] is a pointer to the head of the 
queue. For most applications, it is not necessary to remember the entire sequence 
of commands [10, Section 3.3.2]. In many cases, all the data that must be kept 
will fit in a single disk block. 

In the application for which Disk Paxos was devised (a future Compaq prod- 
uct), the set of processors is not known in advance. Each disk contains a directory 
listing the processors and the locations of their disk blocks. Before reading a disk, 
a processor reads the disk’s directory. To write a disk’s directory, a processor 
must acquire a lock for that disk by executing a real-time mutual exclusion al- 
gorithm based on Fischer’s protocol [8]. A processor joins the system by adding 
itself to the directory on a majority of disks. 



4.2 Concluding Remarks 

We have presented Disk Paxos, an efficient implementation of the state machine 
approach in a system in which processors communicate by accessing ordinary 
(nonprogrammable) disks. In the normal case, the leader commits a command 
by writing its own block and reading every other processor’s block on a majority 
of the shared disks. This is clearly the minimal number of disk accesses needed. 

Disk Paxos was motivated by the recent development of the Storage Area Net- 
work (SAN) — an architecture consisting of a network of computers and disks in 
which all disks can be accessed by each computer. Commodity disks are cheaper 
than computers, so using redundant disks for fault tolerance is more economical 
than using redundant computers. Moreover, since disks do not run application- 
level programs, they are less likely to crash than computers. 

Because commodity disks are not programmable, we could not simply sub- 
stitute disks for processors in the classic Paxos algorithm. Instead we took the 
ideas of classic Paxos and transplanted them to the SAN environment. What 
we obtained is almost, but not quite, a generalization of classic Paxos. Indeed, 
when Disk Paxos is instantiated to a single disk, we obtain what may be called 
Shared- Memory Paxos. Algorithms for shared memory are usually more succinct 
and clear than their message passing counterparts. Thus, Disk Paxos can be con- 
sidered yet another revisiting of classic Paxos that exposes its underlying ideas 
by removing the message-passing clutter. Perhaps other distributed algorithms 
can also be made more clear by recasting them in a shared-memory setting. 
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Appendix 

We now give precise specifications of the consensus problem solved by the Disk 
Synod algorithm and of the algorithm itself. The specifications are written in 
TLA+, a formal language that combines the temporal logic of actions (TLA) [9], 
set theory, and first-order logic with notation for making definitions and encap- 
sulating them in modules. These specifications have been debugged with the aid 
of the TLC model checker [17]. (However, errors may have been introduced by 
the manual process of translating from TLA+ to LdJ]K.) TLA+ is described 
in [11]; annotated versions of the specifications, with fuller explanations of the 
TLA+ constructs, appear in [5]. 

We feel that the algorithm’s nonblocking property is sufficiently obvious not 
to need a rigorous specification and proof, so we consider only consistency. We 
therefore do not specify any liveness properties, so we make very little use of 
temporal logic. 



The Specification of Consensus 

We assume that there are N processors, numbered 1 through N. Each processor 
p has two registers: an input register input[p] that initially equals some element 
of the set Inputs of possible input values, and an output register output[p] that 
initially equals a special value NotAnInput that is not an element of Inputs. 
Processor p chooses an output value by setting output[p]. It can also fail, which 
it does by setting input[p] to any value in Inputs and resetting output[p] to 
NotAnInput. The precise condition to be satisfied is that, if some processor p 
ever sets output[p] to some value then 

— V must be a value that is, or at one time was, the value of input[q] for some 
processor q 

— if any processor r (including p itself) later sets output[r] to some value w 
other than NotAnInput^ then w = v. 

We first define a specification ISpec that has two additional variables: allinput ^ 
the set of all inputs chosen so far, and chosen^ which is set to the first output 
value chosen. The actual specification SynodSpec is obtained from ISpec by hid- 
ing allinput and chosen. Hiding in TLA is expressed by the temporal existential 
quantifier 3 . To formally define SynodSpec in TLA+, we define ISpec in a sub- 
module that is then instantiated. However, the reader not familiar with TLA+ 
can ignore these details and pretend that SynodSpec is simply defined to equal 
3 allinput ^ chosen : ISpec. 

The reader unfamiliar with TLA can consider the specification ISpec to con- 
sist of two parts: the initial predicate Unit and the next-state action IN ext ^ 
which is a predicate relating the new (primed) state with the old (unprimed) 
state. 

Most of the TLA+ notation used in the definitions should be self-evident, 
except for the following function constructs: [x e S g{x)] is the function / 
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with domain S such that f[x] = g{x) for all x in 5; [S T] is the set of all 
functions with domain S and range a subset of T; and [/ except \[x] = e] 
is the function / that is the same as / except that f[x] = e. TLA+ allows 
conjunctions and disjunctions to be written as bulleted lists, with indentation 
used to eliminate parentheses. 

The specification is contained in the following module named SynodSpec. The 
module begins with an extends statement that imports the Naturals module, 
which defines the set Nat of natural numbers and the usual arithmetic opera- 
tions. The Naturals module also defines i .. j to he the set of natural numbers 
from i through j . 

MODULE SynodSpec 

EXTENDS Naturals 

CONSTANT N, Inputs 
ASSUME {N G Nat) A (A > 0) 

Proc = 1 . . N 

NotAnInput = choose c : c ^ Inputs 
VARIABLES input ^ output 

I MODULE Inner 

VARIABLES alllnput^ chosen 

Unit = A input G [Proc Inputs] 

A output = [p e Proc 1 -^ NotAnInput] 

A chosen = NotAnInput 
A allinput = {input[p] : p G Proc} 

Choose{p) = 

A output[p] = NotAnInput 
A IF chosen = NotAnInput 

THEN 3 G allinput : A choscN = ip 

A outpup = [output EXCEPT ! [p] = ip] 

ELSE A outpuP = [output EXCEPT ! [p] = chosen] 

A UNCHANGED choscn 
A UNCHANGED {input, allinput) 

Fail{p) = A outpuP = [output except ![^] = NotAnInput] 

A 3 ip e Inputs : A inpuP = [input EXCEPT ! [p] = ip] 

A allInpuP = allinput U {ip} 

A UNCHANGED choscn 

INcxt = 3^ G Proc : Choose{p) V Fail{p) 

ISpcc Unit A ^[INcxt](,ij^p^i^Q^ip^i^(.i^QgQj^^Q^iijj^p^i\^ 

I 

IS {chosen, allinput) = instance Inner 

SynodSpec = 3 chosen, allinput : IS {chosen, allinput) \ IS pec 
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The Disk Synod Algorithm 

The Disk Synod algorithm’s specification appears in module DiskSynod^ which 
uses an extends statement to import all the declarations and definitions from 
the SynodSpec module. The specification introduces three new constant param- 
eters: an operator Ballot such that B allot {p) is the set of ballot numbers that 
processor p can use; a set Disk of disks; and a predicate IsMajority^ which gen- 
eralizes the notion of a majority. The specification asserts the assumptions that 
different processors have disjoint sets of ballot numbers, and that, for any sub- 
sets S and T of Disk^ if IsMajority{S) and IsMajority{T) are true, then S and 
T are not disjoint. 

The specification uses the following variables: input and output are imported 
from the SynodSpec module; dhlock and disk were explained in the informal 
description of the algorithm; phase[p] is the current phase of processor p, which 
is set to 0 when p fails and to 3 when p chooses its output; disks Written[p] 
is the set of disks that processor p has written during its current phase; and 
blocksRead[p][d] is the set of values p has read from disk d during its current 
phase. 

Some additional TLA+ notation is introduced in the specification. TLA+ 
has the following record constructs: [fi Vn] is the record r 

with fields /i, . . . ^ f ^ such that r.fi = for each z; and [/i : /Si, . . . ,/n : /Sn] 
is the set of all such records with Vi an element of the set Si^ for each i. The 
EXCEPT construct has the following extensions: in [/ except ! [x] = e], an @ in 
expression e denotes f[x]; the except part can have multiple “replacements” 
separated by commas; and the construct generalizes to functions of functions in 
the obvious way — for example, [/ except l[x][y] = e]. In TLA+, subset 5 is 
the set of all subsets of 5, and union S is the union of all the elements of S. 

The algorithm’s specification is formula DiskSynodSpec^ but the reader un- 
familiar with TLA can consider the specification to be the initial predicate Init 
and the next-state action Next. The module ends by asserting the correctness of 
the algorithm, expressed in TLA by the statement that the algorithm’s specifica- 
tion implies its correctness condition. On first reading, we recommend jumping 
from the definition of Init to the definition of Next , and then reading backwards 
to see what is defined in terms of what. 

MODULE DiskSynod 

EXTENDS SynodSpec 

CONSTANTS Ballot(-), Disk, IsMajority(-) 

ASSUME A Vp G Proc : A Ballot{p) C {n e Nat : n > 0} 

A \f q e Proc \ {p} : Ballot{p) fl Ballot{q) = {} 

A y S,T e SUBSET Disk : 

IsMajority{S) A IsMajority{T) => (S' Pi T 7 ^ {}) 

DiskBlock = [mhal : (union {Ballot{p) : p e Proc}) U {0} , 
bal : (union {Ballot{p) : p e Proc}) U {0}, 
inp : Inputs U {NotAnInput} ] 
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InitDB = [mbal i-^ 0, bal 0, inp NotAnInput] 

VARIABLES disk^ dblock^ phase^ disksWritten^ blocksRead 

vars = {input, output, disk, phase, dblock, disks Written, bloeksRead) 

Init = A input G [Proe Inputs] 

A output = [p e Proe NotAnInput] 

A disk = [d e Disk [p e Proe InitDB]] 

A phase = [p e Proe 0] 

A dbloek = [p e Proe InitDB] 

A output = [p e Proe NotAnInput] 

A disks Written = [p e Proe {}] 

A blocksRead = [p £ Proe [d e Disk {}]] 

hasRead{p, d,q) = 3 G blocksRead[p][d] : br.proc = q 

allBlocksRead{p) = let allRdBlks = union {blocksRead[p][d] : d G Disk} 
IN {br. block : br e allRdBlks} 

InitializePhase{p) = 

A disksWrittcN = [disks Written except \[p] = {}] 

A blocksRead^ = [blocksRead except \[p] = [d e Disk {}]] 

StartB allot {p) = 

A phase[p] G {1, 2} 

A phasP = [phase EXCEPT ! [p] = 1] 

A 3b e Ballot (p) : A b > dblock[p].mbal 

A dblocN = [dblock EXCEPT \[p].mbal = b] 

A InitializePhase{p) 

A UNCHANGED {input , output , disk) 

Phasel or2 Write {p, d) = 

A phase[p] G {1, 2} 

A dist = [disk EXCEPT l[d][p] = dblock[p]] 

A disksWrittcN = [disksWritten except \[p] = @ U {d}] 

A UNCHANGED {input, output , phase, dblock, blocksRead) 

Phaselor2Read{p, d, q) = 

A d e disks Written[p] 

A IF disk[d][q].mbal < dblock[p].mbal 
THEN A blocksRead^ = 

[blocksRead except 

\[p][d] = @ U {[block 1 -^ disk[d][q], proe q]}] 

A UNCHANGED 

{input, output, disk, phase, dblock, disksWritten) 

ELSE StartB allot (p) 
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EndPhaselor 2 {p) = 

A IsMajority{{d G disks Written[p] : \/ q e Proc\{p} : has Read {p^ d^ q)}) 
A V A phase[p] = 1 
A dhlocE = 

[ dblock EXCEPT 

\[p].hal = dhlock[p].mhal^ 

\[p\.inp = LET blocksSeen = allBlocksRead{p) U {d6/ocA;[^]} 
nonlnitBlks = 

{65 G blocksSeen : bs.inp ^ NotAnInput} 
maxBlk = CHOOSE b G nonlnitBlks : 

V c G nonlnitBlks : b.bal > c.bal 
IN IF nonlnitBlks = {} then 

ELSE maxBlk. inp] 

A UNCHANGED output 
V A = 2 

A outpuP = [output EXCEPT ![^] = d6/ocA;[^].m^] 

A UNCHANGED dblock 
A phase^ = except ! [_p] = @ + l] 

A InitializePhase{p) 

A UNCHANGED {input, disk) 

Fail{p) = A G Inputs : inpuP = [input except ! [^] = 

A except ! [;?] = 0] 

A dblocE = EXCEPT ! [^] = InitDB] 

A outpuP = EXCEPT ! [^] = NotAnInput] 

A InitializePhase{p) 

A UNCHANGED dzsA; 

PhaseORead{p, d) = 

A = 0 

A blocks Read^ = except 

•[p][^] = @ U {[6/ocA; 1-^ disk[d][p], proc p]}] 

A UNCHANGED {input, output, disk, phase, dblock, disks Written) 

EndPhaseO{p) = 

A phase[p] = 0 

A IsMajority{{d G Disk : /za5i^ead(^, d, ^)}) 

A 3 6 G Ballot{p) : 

A Vr G allBlocksRead{p) : b > r.mbal 
A dblocE = except 

![;?] = [(choose r G allBlocksRead{p) : 

V5 G allBlocksRead{p) : r.bal > s.bal) 
EXCEPT l.mbal = b] ] 

A InitializePhase{p) 

A phase^ = [^/zase except ! [^] = 1] 

A UNCHANGED {input , output , disk) 
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Next = 3p e Proc : 

V starts allot (p) 

W 3d e Disk : V PhaseORead{p, d) 

V Phaselor2Write{p^ d) 

V 3 g G Proc \ {p} : Phaselor2Read{p^ d, q) 

V EndPhaselor2{p) 

V Fail{p) 

V EndPhaseO(p) 

DiskSynodSpec = Init A 
THEOREM DiskSynodSpec => SynodSpec 




Objects Shared by Byzantine Processes 

(Extended Abstract) 



Dahlia Malkhi"*" Michael Merritt"*^ Michael Reiter*^ Gadi Taubenfeld^ 



Abstract. Work to date on algorithms for message-passing systems has 
explored a wide variety of types of faults, but corresponding work on 
shcired memory systems has usually assumed that only crash faults are 
possible. In this work, we explore situations in which processes accessing 
shared objects can fail arbitrcirily (Byzantine faults). 



1 Introduction 

1.1 Motivation 

It is commonly believed that message-passing systems are more difficult to pro- 
gram than systems that enable processes to communicate via shared memory. 
Many experimental and commercial processors provide direct support for shared 
memory abstractions, and increasing attention is being paid to implementing 
shared memory systems either in hardware or in software [Bel92, CG89, LH89, 
TKB92], Moreover, several middleware systems have been built to implement 
shared memory abstractions in a message-passing environment. Of primary inter- 
est here are those that employ replication to provide fault-tolerant shared mem- 
ory abstractions, particularly those designed to mask the arbitrary (Byzantine) 
failure of processes implementing these abstractions (e.g., see [PG89, SE-l-92, 
Rei96, KMM98, CL99, MROO]). These middleware systems generally guaran- 
tee that shared objects themselves do not “fail”, and hence, that their integrity, 
safety properties, and access interfaces and restrictions, are preserved. Neverthe- 
less, since legitimate clients accessing these objects might fail arbitrarily, they 
could corrupt the states of these objects in any way allowed by the object inter- 
faces. 

The question we address in this paper is: What power do shared memory 
objects have in such environments, in achieving any form of coordination among 
distributed processes that access these objects? This question is daunting, as 
Byzantine faulty processes can configure objects in any way allowed by the object 
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interfaces. Thus, seemingly even very strong shared objects such as consensus 
objects (which are universal for crash failures) might not be very useful in such a 
Byzantine environment, as faulty processes erroneously set their decision values. 
Surprisingly, although work to date on algorithms for message-passing systems 
has explored a wide variety of types of faults, corresponding work on shared 
memory systems has usually assumed that only crash faults are possible. Hence, 
our work is the first study of the power of objects shared by Byzantine processes. 



1.2 Summary of results 

We generalize the crash-fault model of shared memory to accommodate Byzan- 
tine faults. We show how a variety of techniques can be used to cooperate reliably 
in the presence of Byzantine faults, including bounds on the numbers of faulty 
processes, redundancy, access control lists that constrain faulty processes from 
accessing specific objects, and persistent objects (such as sticky bits [Plo89]) 
which cannot be overwritten. (We call objects that are not persistent, such as 
read/write registers, ephemeral.) We define a notion of shared object that is 
appropriate for this fault model, in which waiting between concurrent opera- 
tions is permitted. We explore the power of some specific shared objects in this 
model, proving both universality and impossibility results, and finally identify 
some non-trivial problems that can be solved in the presence of Byzantine faults 
even when using only ephemeral objects. 

The notions of consensus objects and sticky bits (a persistent, readable con- 
sensus object) in the Byzantine model, are formally defined in section 2. The 
results are: 

1. Universality result: Our main result shows that sticky bits can be used 
to construct any other object (i.e., they are universal), assuming that the 
number of (Byzantine) faults is bounded by {y/n — l)/2, where n is the total 
number of processes. 

To prove this result, a universal construction is presented that works as 
follows: First, sticky bits are used to construct a strong consensus object, 
i.e., a consensus object whose decision is a value proposed by some correct 
process. Equipped with strong consensus objects, we proceed to emulate any 
object. Our emulation borrows closely from Herlihy’s universal construction 
for crash faults [Her91], but differs in significant ways due to the need to 
cope with Byzantine failures. 

2. Bounds on faults: We observe that strong consensus objects, used to prove 
the universality result, cannot be constructed when the possible number of 
faults is t > n/Z. We observe that there exists a simple bounded-space 
universal object assuming t < n/3, and a trivial unbounded-space universal 
object assuming any number oit <n faults. We prove that when a majority 
of the processes may be faulty, even weak consensus (i.e., a consensus object 
whose decision is a value proposed by some correct or faulty process) cannot 
be solved using any of the familiar non-sticky objects. 
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3. Constructions using ephemeral objects: While the universality result 
involves sticky bits, the impossibility result shows that consensus cannot be 
implemented using known objects that are not persistent. This raises the 
question of what can be done with such ephemeral objects. We show how 
various objects, such as k-set consensus and fc-pairwise consensus, can be 
implemented in a Byzantine environment using only atomic registers. Then 
we show that familiar objects such as test&set, swap, compare&swap, and 
read-modify- write, can be used to implement election objects for any number 
of processes and under any number of Byzantine faults. 

1.3 Related work 

The power of various shared objects has been studied extensively in shared mem- 
ory environments where processes may fail benignly, and where every operation 
is wait- free: the operation is guaranteed to return within a finite number of steps. 
Objects that can be used (together with atomic registers) to give a wait-free im- 
plementation of any other objects are called universal objects. Previous work on 
wait-free (and non-blocking) shared objects provided methods (called universal 
constructions) to transform sequential implementations of arbitrary shared ob- 
jects into wait-free concurrent implementations, assuming the existence of a uni- 
versal object [Her91, Plo89, JT92]. In particular, Plotkin showed that sticky bits 
are universal [Plo89], and independently, Herlihy proved that consensus objects 
are universal [Her91]. Herlihy also showed that shared objects can be classified 
according to their consensus number: that is, the maximum number of processes 
that can reach consensus using the object [Her91]. Attie investigates the power 
of shared objects accessed by Byzantine processes for achieving wait-free Byzan- 
tine agreement. He proves that strong agreement is impossible to achieve using 
resettable objects, i.e., objects that can be reset back to their initial setting, and 
constructs weak agreement using sticky bits [AttOO]. 

Assume that at some point in a computation a shared register is set to some 
unexpected value. There are two complementary ways to explain how this may 
happen. One is to assume that the register’s value was set by a Byzantine process 
(as may happen in the model of this paper). The other way is to assume that 
the processes are correct but the register itself is faulty. The subject of memory 
faults (as opposed to process faults) has been investigated recently in several 
papers [AGMT95, JCT98]. These papers assume any number of process crash 
failures, but bound the number of faulty objects, whereas we bound the number 
of (Byzantine) faulty processes, but each might sabotage all the objects to which 
it has access. 

As described in the introduction, our focus on a shared memory Byzantine 
environment is driven by previous work on message-passing systems that emulate 
shared memory abstractions tolerant of Byzantine failures (e.g., [PG89, SEd-92, 
Rei96, KMM98, CL99, MROO]). Though these systems guarantee the correctness 
of the emulated shared objects themselves, the question is what power do these 
objects provide to the correct processes that use them, in the face of corrupt 
processes accessing them. 
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2 Model and definitions 

Our model of computation consists of an asynchronous collection of n processes, 
denoted Pi, . . . that communicate via shared objects. In any run any process 
may be either correct or faulty. Correct processes are constrained to obey their 
specifications, while faulty processes can deviate arbitrarily from their specifi- 
cations (Byzantine failures) limited only by the assumptions stated below. We 
denote by t the maximum number of faulty processes. 

2.1 Shared objects with access control lists 

Each shared object presents a set of operations, e.g., a:. op denotes operation op 
on object x. For each such operation on there is an associated access control 
list (ACL) that names the processes allowed to invoke that operation. Each 
operation execution begins with an invocation by a process in the operation’s 
ACL, and remains pending until a response is received by the invoking process. 
The ACLs for two different operations on the same object can differ, as can the 
ACLs for the same operation on two different objects. The ACLs for an object 
do not change. For any operation x.op, we say that x is k-op if the ACL for 
x.op lists k processes. We assume that a process not on the ACL for ar.op cannot 
invoke x.op, regardless of whether the process is correct or Byzantine (faulty). 
That is, a (correct or faulty) process cannot access an object in any way except 
via the operations for which it appears on the associated ACLs. 

We note that the systems that motivated our study typically employ repli- 
cation to fault-tolerantly emulate shared memory abstractions. Therefore, ACLs 
can be implemented, e.g., by storing a copy of the ACL with each replica and 
filtering out disallowed operations before applying them to the replica. In this 
way, only operations allowed by the ACLs will be applied at correct replicas. 

2.2 Fault tolerance and termination conditions 

In wait-free fault models, no bound is assumed on the number of potentially 
faulty processes. (Hence, no process may safely wait upon an action by another.) 
Any operation by a process p on a shared object must terminate, regardless of 
the concurrent actions of other processes. This model supports a natural and 
powerful notion of abstraction, which allows complex implementations to be 
viewed as atomic [HW90]. We extend this model in two ways: first, we make 
the more pessimistic assumption that process faults are Byzantine, and second, 
we make the more optimistic assumption that the number of faults is bounded 
by tj where t is less than the total number of processes, n. With the numbers 
of failures bounded away from n, it becomes possible (and indeed necessary) 
for processes to coordinate with each other, using redundancy to overcome the 
Byzantine failures of their peers. This means that processes may need to wait 
for each other within individual operation implementations. 

An example that may provide some intuition is a sticky bit object emulated 
by an ensemble of data servers, such that the value written to it must reflect a 
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value written by some correct process. A distributed emulation may implement 
this object by having servers set the object’s value only when t + 1 different 
processes write to it the same value. Of course, this object will be useful only 
when any value written to the object is indeed written by at least t-\~l processes, 
and so an application must guarantee that f + 1 correct processes write identical 
values. Below, we will see examples of such constructions. 

Such an implementation is not wait-free, and raises the question of appropri- 
ate termination conditions for object invocations in a Byzantine environment. 
To address such concerns, we introduce two object properties, t- threshold and 
t-resilience. The first captures termination conditions appropriate for an object 
on which each client should invoke a single operation, and which function cor- 
rectly once enough correct processes access them. The second is appropriate 
when processes perform multiple operations on an object, each of which may 
require support from a collection of correct processes. 

i-threshold: For any operation a?. op, we say that a:. op is f-threshold if a:. op, 
when executed by a correct process, eventually completes in any run p in which 
n — t correct processes invoke a;. op. 

t -resilience: For any operation o^.op, we say that x.op is /-resilient if x.op, 
when executed by a correct process, eventually completes in any run p in which 
each of at least n — t correct processes infinitely often has a pending invocation 
of a:. op. 

An object is /-threshold (/-resilient) if all the operations it supports are /- 
threshold (/-resilient). Notice that /-threshold implies /-resilience, but not vice 
versa. 



2.3 Object definitions 

Below we specify some of the objects used in this paper. 

Atomic registers: An atomic register x is an object with two operations: x.read 
and a?.write(?;) where t; ^ J_. An x.read that occurs before the first a:.write() 
returns ±. An ar.read that occurs after an a?.write() returns the value written in 
the last preceding a;.write() operation. Throughout this paper we employ wait- 
free atomic registers, i.e., a^.read or x.write() operations by correct processes 
eventually return (regardless of the behavior of other processes) . 

Sticky bits: A sticky bit x is an object with two operations: a?. read and 
a?.write(u) where v E {0,1}. An x.read that occurs before the first ir.write() 
returns L. An x.read that occurs after an x.write() returns the value written in 
the first x.write() operation. We will be concerned with wait-free sticky bits. 

Weak consensus objects: A weak (binary) consensus object x is an object with 
one operation: a^.propose(u), where v E (0, 1}, satisfying: (1) The x.propose() op- 
eration returns the same value, called the consensus value, to every process that 
invokes it. (2) If the consensus value is u, then some process invoked a!.propose(i;). 

Strong consensus objects: A strong (binary) consensus object x strengthens 
the second condition above to read: (2) If the consensus value is v, then some 
correc/ process invoked a?.propose(t;) . 
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Observe that one sticky bit does not trivially implement a strong consensus 
object, where each process first writes this bit and then reads it and decides 
on the value returned. The first process to write the bit might be a faulty one, 
violating the requirement that the consensus value must be proposed by some 
correct process, (In Lemmas 2 and 3 we describe more complex implementations 
of strong consensus from sticky bits.) Indeed, strong consensus objects do not 
have sequential runs: the additional condition, using redundancy to mask fail- 
ures, requires at least ^ -h 1 processes to invoke x.propose() before any correct 
process returns from this operation. (In addition, Theorem 4 in Section 3.2 shows 
that ^-resilient strong consensus objects are ill-defined when t > n/3.) 

Throughout the paper, unless otherwise stated, by a consensus object we 
mean a consensus object. Also, atomic registers and sticky bits are always 

assumed to be wait-free. 

3 A universal construction 

This section contains the main result of this paper, the construction of a universal 
^-resilient object from wait-free sticky bits. That is, we show that sticky bits are 
universal when the number of faults is small enough. 

We assume any fault-tolerant object, o, is specified by two relations: 

apply C INVOKE x STATE x STATE, 

and reply C INVOKE x STATE x RESPONSE, 

where INVOKE is the object’s domain of invocations, STATE is its domain 
of states (with a designated set of start states), and RESPONSE is its domain 
of responses. The apply relation denotes a nondeterministic state change based 
on the specific pending invocation and the current state (invocations do not 
block: we require a target state for every invocation and current state) , and the 
reply relation nondeterministically determines the calculated response, based 
on the pending invocation and the updated state.^ It is necessary to define 
two relations because in fault-tolerant objects (such as strong consensus), the 
response may depend on later invocations. The apply relation allows the state to 
be updated once the invocation occurs, without yet determining the response. 
The reply relation may only allow a response to be determined when other 
pending invocations update the state. 

For example, a ^-threshold strong consensus object can be specified as fol- 
lows: STATE is the set of integer pairs, {x,y), 0 < x^y < t, or the singletons 0 
and 1, with (0, 0) as the single start state. For all integers x, y and u, v in {0, 1} 
(constrained as shown), the apply relation is, {(propose(O), (a: < t,t/),(ar-h 

^ This formulation generalizes Herlihy’s specification of wait-free objects by a single 
relation apply C INVOKE x STATE x STATE x RESPONSE, restricted (by the 
wait-free condition) to have at least one target state and response defined for any pair 
INVOKE X STATE [Her91]. This formulation is insufficient to define fault- tolerant 
objects such as strong consensus. 




Objects Shared by Byzantine Processes 351 



l,y)}) U {(propose(1),(x,j/ < t), (a;, y + 1))} U {(propose(O), (t,y),0)} U 
{(propose(I), 1}) U { (propose ( w), t) e {0, 1}, tj)}, and the reply relation 
is {(propose(zx), f G {0, l},RETURN(t;))}. Hence, each invocation of a propose 
operation enables apply to increment the appropriate counter in the state. Con- 
current invocations introduce race conditions (as to which application of apply 
occurs first. Once i-\-l applications of the same value occur, the state is commit- 
ted to that binary value, and the responses of pending invocations are enabled. 

For the purposes of the universal construction below, we resolve any non- 
determinism, and assume that the first relation is a function from INVOKE x 
STATE to STATE, and that the second relation is a partial function from 
INVOKE X STATE to RESPONSE. Given these restrictions, we may assume, 
without loss of generality, that the object ^s domain of states is the set of strings 
of invocations, and that the function from INVOKE x STATE to STATE, 
simply appends the pending invocation to the current state. 

Theorem 1. Any i -resilient object can be implemented using: 

1. [t + l)-write(), n-read sticky bits and l-write(), n-read sticky bits, provided 

that n > (t + 1); or 

2. [2t -i- l)-write(), [2i + l)-read sticky bits and l-write(), n-read sticky bits, 
provided that n > (2^ -|- 1)^. 

Figure 1 describes a universal implementation. In the lemmas, we provide two 
constructions of (strong) binary consensus objects using sticky bits, which differ 
in the access restrictions. 

Lemma2. If n > (f -h 1)(2^ -h 1), then an n-propose() t-threshold consensus 
object can be implemented using [t + l)-write(), n-read sticky bits. 

Proof: Let o be the consensus object that is being implemented. Let m = Lt^J* 
Partition the n processes into blocks J5i , . . . , Bm , each of size at least ^ + 1, and 
let be sticky bits with the property that the ACL for x,-.write() is 

Bi (or a (t -f l)-subset thereof) and the ACL for Xi.read is {pi, , . . ,Pn}- For a 
correct process p G Bi to emulate o.propose(t;) , it executes Xj.write(tj) (or skip 
if p is not in the ACL for ar,) and, once that completes, repeatedly executes 
a?j.read for all 1 < j < m until none return J_. p chooses the return value 
from o. propose (v) to be the value that is returned from the read operations on 
a majority of the All correct processes obtain the same return value from 

their o. propose () emulations because the Xi'^s are sticky. If no correct process 
emulates o. propose (t;), then since in > 2t + 1, v will not be returned from the 
reads on a majority of the xj ^s and thus will not be the consensus value. Because 
each correct process reads Xj, 1 < j < m, until none return ±, termination is 
guaranteed provided that each sticky bit is set. Since each Xj has ^ -|- 1 processes 
proposing to it, it follows that o. pro pose () is guaranteed to return when at least 
n — t perform propose () operations. □ 

® In case m is even and the number of Ts equals the number of O’s, the majority value 
is defined to be 1. 
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Lemma 3. If n > {2t 1)^^ then an n-propose() t^threshold consensus object 
can be implemented using (2t + l)-write()^ {2t l)-read sticky bits and l-write(), 
n-read sticky bits. 

Proof: Let o be the consensus object that is being implemented. Let , rn 

be l-write(), n-read sticky bits such that the ACL for rj. write () is {pi}- Let 
m = Partition the n processes into blocks . . .^Bm^ each of size at 

least -j- 1, and let a?i, . . . , Xm be sticky bits with the property that the ACLs 

for x,-.write() and ar,-.read are both Bi (or a (2t -1- l)-subset thereof). For a correct 
process pj E Bi to emulate o.propose(t'), it executes a?^.write(^’) (or skip if p is 
not in the ACL for Xi) and, once that completes, it executes rj a?i.read. pj 
then repeatedly reads the (single- writer) bits of all processes until for each Bk , 
it observes the same value VJ^ in the bits of ^ + 1 processes in Bk ; note that Vk 
must be the value returned by :c;;.read (to a process allowed to execute ar^^-read). 
The value that occurs as t -f 1 such Vk ^s is selected as the return value from 

o. propose (n). Because X{ is sticky and Bi contains at most t faulty processes, 
Vi is unique; thus, all correct processes obtain the same return value from their 
o. propose 0 emulations. If no correct process emulates o. propose (i;), then since 
m > 2f + 1, i; cannot occur in the majority of the Vj’s. □ 

3.1 Proof of Theorem 1 

For simplicity, we initially describe a universal construction of objects for which 
the domain of invocations is finite. Subsequently, we explain how to modify the 
construction to implement objects with (countably) infinite invocation domains. 

The construction conceptually mimics Herlihy’s construction showing that 
consensus is universal for wait-free objects in the fail-stop model [Her91]. Due 
to the possibility of arbitrarily faulty processes in our system model, however, 
construction below differs in significant ways. 

The construction labors to ensure that operations by correct processes even- 
tually complete, and that each operation by Byzantine processes either has no 
impact, or appears as (the same) valid operation to the correct processes. There 
are two principal data structures: 

1. For each process pi there is an unbounded array Announce[i][l...], each ele- 
ment of which is a ‘‘cell”, where a cell is an array of [log(| INVOKED] sticky 
bits. The Announce [i][j] cell describes the j-th invocation (operation name 
and arguments) by pi on o. Accordingly, the ACL for the write () operation 
of each sticky bit in each cell of Announce [z] names pj. 

2. The object itself is represented as an unbounded array Sequence [1...] of 
process-id’s, where each Sequence [Ar] is a flog(n)] string of /-threshold, strong 
binary consensus objects. We refer to the value represented by the string of 
bits in Sequence [^] simply as Sequence [Ar], Intuitively, if Sequence [A:] = * and 
Sequence[l], . . . , Sequence[/? — 1] contains the value i in exactly j — I posi- 
tions, then the k-th. invocation on o is described by Announce [z][jj. In this 
case, we say that Announce[z]y] has been threaded. 
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type: ID: array of [log(n)] strong consensus objects 

CELL: array of [logdlNVOKEI)] sticky bits 
global variables: 

Announce[l..n][l...], array of CELL 

for ciU 1 < < n, and elements of Announce[i][i] ai’e writable by pi 

Sequence[l...], infinite array of /Z?s, each accessible by all processes 
variables private to process pi: 

My Next Announce, index of next vacant cell in Announce[ 2 *], initially 1 
NextAnnounce[L.n], for each 1 < i < «, index in Announce[j][] 

of next operation of to be read by pi , initially 1 
CurrentState € STATE, pCs view of the state of o, initially the initial state of o, 
NextSeq, next position to be threaded in SequenceQ as seen by p,-, initially 1 
NameSuffix, [log(n)j bit string 

o.op: 

(1) write, bit by bit, the invocation, 

o,o[>Anvoke of o.op into Announce[i][MyNextAnnounce] 

(2) MyNextAnnounce++ 

; Apply operations until o.op is applied and p» can return. 

; Each while loop iteration applies exactly one operation. 

(3) while ((NextAnnounce[«] < MyNextAnnounce) or 

((NextAnnounce[i] > MyNextAnnounce) 

and (rep/y(o. op. muoA:e, CurrentState) is not defined))) do 

(4) t <r- NextSeq (mod n) ; Select preferred process to help. 

(5) NameSuffix <— emptystring 

(6) for A; = 0 to |*log(n)] do ; Loop applies the operation one bit per iteration. 

Secirch for a valid process index to propose 

(7) while ((Announce[^ + l][NextAnnounce[^ + 1]] is invalid) 

or (NameSuffix is not a suffix of the bit encoding of f -f 1)) do 

(8) i (mod n) od 

; Propose the A:'th bit (right to left) of ^ + 1 

(9) prepeu(i(NameSuffix, Sequence[NextSeq][A:].propose((^ + 1)&(2*)) 

(10) od ; A new cell has been threaded by NameSuffix in Sequence[NextSeq] 

(11) CurrentState 

app/y(Announce[NameSuffix][NextAnnounce[NameSuffix]], CurrentState) 

(12) NextAnnounce[NameSuffix] -f + 

(13) NextSeq-{-+ 

(14) od 

(15) return[reply(o.o\> .invoke^ CurrentState)) 

Figure 1: Universal implementation of o.op at pi. 

The universal construction of object o is described in Figure 1 as the code 
process pi executes to implement an operation o.op, with invocation o, op. invoke. 
In outline, the emulation works as follows: process pi first announces its next 
invocation, and then threads unthreaded, announced invocations onto the end 
of Sequence. It continues until it sees that its own operation has been threaded, 
and that enough additional invocations (if any) have been threaded, that it can 
compute a response and return. To assure that each announced invocation is 
eventually threaded, the correct processes first try to thread any announced, 
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unthreaded cell of process pi^^i into entry Sequence[Ar], where £ = A; (mod) n. 
(Once process announces an operation, at most n other operations can be 
threaded before p^+i’s,) 

In more detail, process pi keeps track of the first index of Announce[^] that 
is vacant in a variable denoted MyfMextAnnounce , and first (line 1) writes the 
invocation, bit by bit, into Announce [i] [My NextAnnounce], and (line 2) incre- 
ments MyNextAnnounce. To keep track of which cells it has seen threaded (in- 
cluding its own), Pi keeps n counters in an array NextAnnounce[l..n], where 
each NextAnnounce [j] is one plus the number of times i has read cells of j 
in Sequence, and hence the index of Announce[j] where i looks to find the 
next operation announced by j. Hence, having incremented MyNextAnnounce, 
NextAnnounce [i] = MyNextAnnounce — 1 until the current operation of pi has 
been threaded. 

This inequality is thus one disjunct (line 3) in the loop (lines 4-10) in which Pi 
threads cells. Once p<^s cell is threaded, (and NextAnnounce [i] = MyNextAnnounce), 
the next conjunct (again line 3) keeps pi threading cells until a response to the 
threaded operation can be computed. (At which time it exits the loop and re- 
turns the associated value (line 15).) Notice that in some cases, this may require 
any finite number of additional operations to be threaded after o.op, but by the 
^-resilient condition, as long as operations of correct processes are eventually 
threaded, eventually o.op can return. For example, if o.op is the propose () opera- 
tion of a strong consensus object, then it can return once at least t -h 1 propose () 
invocations with identical values occur. Process pi keeps an index NextSeq which 
points to the next entry in Sequence [1, ...] whose cells it has not yet accessed. 

To thread cells, process pi proposes (line 9) the binary encoding of a process 
id, ^ + 1, bit by bit, to Sequence[NextSeq]. In choosing process pi first 

checks (first disjunct, line 7) that Announce[^ -h l][NextAnnounce[^ -h 1]] contains 
a valid encoding of an operation invocation. (And, as discussed above, pi gives 
preference (line 4) to a different process for each cell in Sequence.) 

Starting (line 5) with the empty string ^ pi accumulates (line 9) the bit-by-bit 
encoding of the id being recorded in Sequence [NextSeq] into a local variable, 
NameSuffix. If a bit being proposed by pi is not the result returned (second 
disjunct, line 7), then pi searches (line 8) for another process to help, whose id 
matches the bits accumulated in NameSuffix. (The properties of strong consensus 
assure that such a process exists.) 

Once process pi accumulates all the bits of the threaded cell into NameSuffix 
(the termination condition (line 6) of the for loop (lines 7-10)), it can update 
(line 11) its view of the object’s state with this invocation, and increment its 
records of (line 12) process NameSuffix’s successfully threaded cells and (line 13) 
the next unread cell in Sequence. Having successfully threaded a cell, pi returns 
to the top of the while loop (line 3). 

The sequencing and correct semantics of each operation follow trivially from 
the sequential ordering of invocations in Sequence and the application of the 
apply and rep/y functions. The proper termination of all correct operations follow 
as argued above from the t-threshold property of the embedded consensus objects 
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The construction and this argument address objects with finite domains of 
invocation. We next briefly outline the modifications necessary to accommodate 
objects with (countably) infinite domains of invocation. The quandary here is 
that the representations of invocations using sticky bits are unbounded. Suppose 
we naively change the type CELL to (unbounded) sequence of sticky bits. 

When process pi attempts to read (line 7) an invocation in Announce -h 
l][NextAnnounce[^+ 1]], a faulty process might cause p, to read forever, by itself 
writing forever, in such a way that each finite prefix is a valid but incomplete 
encoding of an invocation. (For any encoding, such a sequence exists by Konig^s 
lemma.) This problem can be avoided by interleaving reads of the bits of each 
entry in Announce [£+ 1] [Next Annou nee [l..n]], starting as before with the next bit 
of NextAnnounce[^ + 1], until one of the accumulated strings validly encodes an 
invocation. Details of the bookkeeping required, and the argument that correct 
invocations are eventually threaded, are left to the reader. (Though note that 
the number of invocations that may be threaded before a correct process’s an- 
nouncement is now dependent on the relative lengths of different encodings.) □ 

3.2 Resilience and impossibility 

The proof of Theorem 1 presents a universal construction of f-resilient objects, 
where t < {y/n~ l)/2. Naturally, one would like to know whether there are more 
fault- tolerant universal constructions, and in the limit, whether wait-free uni- 
versal constructions exist. Focusing on improving the the bound i < {^/n’— l)/2 
in Theorem 1, that is, finding a universal construction or impossibility proofs 
t > — l)/2, we note that the construction in Figure 1 builds modularly on 

f-resilient strong consensus. The i < — l)/2 bound of Theorem 1 follows 

from the constructions of strong consensus from sticky bits, in Lemmas 2 and 3. 
Constructions of strong consensus from sticky bits for larger values of t would 
imply a more resilient universality result. The theorem below demonstrates that 
such a search is bounded by t < rz/3. 

Theorem 4. Fort > n/3, there is no t-resilient n—pTopose{) (strong) consensus 
object. 

Proof. Assume to the contrary that there exists such an object. Let Pq and Pi 
be two sets of processes such that for each Pi (where i E {0,1}) the size of Pi is 
[n/3] and all processes in Pi propose the value i (i.e., have input i). Run these 
two groups as if all the 2 [n/3] processes are correct until they all commit to a 
consensus value. Without loss of generality, let this value be 0. Next, we let all 
the remaining processes propose 1 and run until all commit to 0. We can now 
assume that all the processes in Pq are faulty and reach a contradiction. □ 

We point out that it is easy to define objects that are universal for any 
number of faults. An example is the append-queue object, which supports two 
operations. The first appends a value onto the queue, and the second reads the 
entire contents of the queue. By directly appending invocations onto the queue, 
the entire history of the object can be read. 
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4 Ephemeral objects 

In this section, we explore the power of ephemeral objects. We prove an im- 
possibility result for a class of erasable objects, and give several fault-tolerant 
constructions. 

Erasable objects: An erasable object is an object in which each pair of oper- 
ations opi and op2, when invoked by different processes, either (1) commute 
(such as a read and any other operation) or (2) for every pair of states s\ 
and S 2 j have invocations invokei and invoke^ such that applyiinvokei^si) — 
apply[invokc 2 S2)« Such familiar objects as registers, test&set, swap, read-modify- 
write are erasable. (This definition generalizes the notion of commutative and 
overwriting operations [Her91].) 

Theorems. For any i > n/2, there is no implementation of a t-resilient 
n“propose() weak consensus object using any set of erasable objects. 

Proof Assume to the contrary the such an implementation, called A, is possible. 
We divide the n processes into three disjoint groups: Pq and Pi each of size at 
least [(n — 1)/2J , and a singleton which includes process p. Consider the following 
finite runs of algorithm A: 

1. po is a run in which only processes in Pq participate with input 0 and halt 
once they have decided. They must all decide on 0. Let Oo be the (finite) set 
of objects that were accessed in this run. and let Sq be the state of object o,- 
at the end of this run. 

2. pi is a run in which only processes in Pi participate with input 1 and halt 
once they have decided. They must all decide on 1. Let Oi be the (finite) set 
of objects that were accessed in this run, and let s\ be the state of object 0 { 
at the end of this run. 

3. pQ is 3, run in which processes from Pq are correct and start with input 0, 
and processes from Pi are faulty and start with input 1. It is constructed as 
follows. First the process from Pq run exactly as in po until they all decide 
on 0. Then, the processes from Pi set all the shared objects in (Oi — Oo) to 
the values that these objects have at (the end of) pi, and set the values of 
the objects in (Oi fl Oq) to hide the order of previous accesses That is, for 
objects in which all operations accessible by Pq and Pi commute. Pi runs 
the same operations as in run po . For each remaining object Oi , Pq invokes 
an operation invokeo such that Pi has access to an operation invokei where 
apply{invokei , s\f) = apply [invokei ^ S 2 )- 

4. Pi is a run in which processes from Pi are correct and start with input 1, 
and processes from Pq are faulty and start with input 0. It is constructed 
symmetrically to p 2 i First the process from Pi run exactly as in pi until 
they all decide on 1. Then, as above the processes from Pq set all the shared 
objects in (Oo ~ Oi) to the values that these objects have at (the end of) 
po. For objects in which all operations accessible by Pq and Pi commute, 
Po runs the same operations as in run po. For each remaining object o*, Pq 
invokes the operation invokei defined in pg. 
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By construction, every object is in the same state after pQ and p[. But if we 
activate process p alone at the end of /?q, it cannot yet decide, because it would 
decide the same value if we activate process p alone at the end of p^ . So p must 
wait for help from the correct processes (which the f-resilience condition allows 
it to do) to disambiguate these identical states. 

Having allowed p to take some (ineffectual) steps, we can repeat the con- 
struction again, scheduling Pq and Pi to take additional steps in each run, but 
bringing the two runs again to identical states. By repeating this indefinitely, we 
create two infinite runs, in each of which the correct processes, including p, take 
an infinite number of steps, but in which p never decides, a contradiction. □ 

4.1 Atomic registers 

Next we provide some examples of implementations using (ephemeral) atomic 
registers. The first such object is t- resilient k-set consensus [Cha93]. 

k-set consensus objects: A A:-set consensus object x is an object with one 
operation: x.propose(t?) where v is some number. The ic.propose() operation 
returns a value such that (1) each value returned is proposed by some process, 
and (2) the set of values returned is of size at most k. 

Theorem 6. For any t < n/3, if i < k then there is an implementation of a 
t-resilient n-propose() k-set consensus object using atomic registers. 

Proof Processes pi through pt+i announce their input value by writing it into 
a register announce[i]^ whose value is initially J_. Each process repeatedly reads 
the announce[l . .t -f- 1] registers, and echoes the first non-_L value it sees in any 
announce\j] entry by copying it into a 1-writer register echo[i^j]. Interleaved 
with this process, pi also reads all the echo[l..n, -h 1] registers, and returns 
the value it first finds echoed the super-majority of 2n/3 -f 1 times in some 
column echo[l..n, /:]. In subsequent operations, it returns the same value, but first 
examines announce[l..t 1] array and writes any new values to ec/io[i, l..f -f 1]. 

Using this construction, no process can have two values for which a super- 
majority of echos are ever read. Moreover, any correct process among pi through 
Pt^i will eventually have its value echoed by a super-majority. Hence, every 
operation by a correct process will eventually return one of at most f -h 1 different 
values. □ 

The implementation above of Ar-set-consensus constructs a f-resilient object. 
The next result shows that registers can be used to implement the stronger 
f-threshold condition. (The proof is omitted from this extended abstract.) 

k -pairwise set-consensus objects: A Ar-pairwise set-consensus object x is an ob- 
ject with one operation: x . propose(t^) where v is some number. The x . propose() 
operation returns a set of at most k values such that (1) each value in the set 
returned is proposed by some process, and (2) the intersection of any two sets 
returned is non-empty. 

Theorem?. For any t < n/3, there is an implementation of a i-threshold 
n-propose() [2t -f 1) -pairwise set-consensus object using atomic registers. 
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4.2 Fault-tolerant constructions using objects other than registers 

Even in the presence of only one crash failure, it is not possible to implement 
election objects [TM96, MW87] or consensus objects [LA87, FLP85] using only 
atomic registers. Next we show that many other familiar objects, such as 2- 
process weak consensus, test&set, swap, compare&swap, and read-modify- write, 
can be used to implement election objects for any number of processes and under 
any number of Byzantine faults. 

Election objects: An election object x is an object with one operation: x . elect () 
The x.elect() operation returns a value, either 0 or 1, such that at most one 
correct process returns 1, and if only correct processes participate then exactly 
one process gets 1 (that process is called the leader). Notice that it is not re- 
quired for all the processes to “know” the identity of the leader. We have the 
following result, (Proof omitted from this extended abstract.) 

Theorems, There is an implementation of (1) n-threshold n-elect() election 
from two-process versions of weak consensus ^ test&set, swap, compare&swap, 
or read-modify- write, and (2) 2-threshold 2-propose() weak consensus from 
2-elect() election. 

5 Discussion 

The main positive result in this paper shows that there is a f-resilient universal 
construction out of wait-free sticky bits, in a Byzantine shared memory envi- 
ronment, when the number of failures t is limited. This leaves open the specific 
questions of whether it is possible to weaken the wait-freedom assumption (as- 
suming sticky bits which are t-threshold or /-resilient) and/or to implement a 
/-threshold object (instead of a /-resilient one) . 

We have also presented several impossibility and positive results for imple- 
menting fault-tolerant objects. There are further natural questions concerning 
the power of objects in this environment, such as: Is the resilience bound in our 
universality construction tight for sticky bits? What is the resilience bound for 
universality using other types of objects? What type of objects can be imple- 
mented by others? The few observations regarding these questions in Section 3.2 
and 4 only begin to explore these questions. 
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Abstract. This paper considers a number of communication problems, 
including end-to-end communication and multicasts, in networks whose 
underlying graphs are directed and acyclic and whose links are subject 
to permanent failures. In the case that each processor has separate input 
queues for each in-edge, we present protocols for these problems that use 
single bit headers. In the case that each processor has a single input queue 
for all of its in-edges, we prove that 0(logd)-bit headers are necessary 
and sufficient, where d is the indegree of the graph. 



1 Introduction 

The end-to-end commnnication problem is to send a sequence of messages from 
a sender to a receiver through an unreliable network. It is a fundamental and 
well studied problem [3-6,8,9,11,13-15], whose solution allows distributed al- 
gorithms to treat unreliable communication networks as if they were reliable 
channels. 

In this paper, we consider the situation in which intermediate vertices are 
memoryless: they store no information about the state of the communication 
between the sender and the receiver. When an intermediate vertex receives a 
packet, it bases its actions on the contents of the packet header. The rest of the 
packet is assumed to have no effect on a protocol. This is to enforce a clear sep- 
aration between the protocol and the application programs using this protocol. 
Memoryless protocols are particularly relevant for public networks such as the 
Internet. They also provide a good starting point for more general theoretical 
investigations. 

A very simple memoryless protocol for end-to-end communication in a di- 
rected acyclic graph (DAG) is for the sender to flood the network with each 
message it wishes to send. Whenever an intermediate process receives a packet 
along one of its incoming links, it sends a copy of that packet along all of its 
outgoing links. To allow the receiver to ignore duplicate copies of a message, 
each packet header contains a sequence number (the index of the message in 
the body of the packet among the sequence of messages the sender wishes to 
transmit). This protocol handles links that can duplicate and reorder messages. 

M. Herlihy (Ed.): DISC 2000, LNCS 1914, pp. 360-373, 2000. 
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It can also tolerate permanent link failures, provided that at least one path from 
the sender to the receiver remains operational. With acknowledgements (sent in 
the reverse direction through the DAG), message losses (which are equivalent to 
temporary link failures) can also be handled. The idea is that the sender repeat- 
edly sends a message until it receives an acknowledgement for that message from 
the receiver. Similarly, when the receiver receives a new message, it repeatedly 
sends acknowledgements for that message until it receives the next message in 
the sequence [7, 17, 19]. 

A bad feature of the flood protocol is that the sizes of the packet headers 
increase without bound as the number of messages that the sender is transmitting 
increases. Aloreover, if the network is not a DAG, this protocol can generate an 
inflnite amount of traflic, even if the sender only wants to transmit a single 
message. In general networks, this problem can be avoided by adding a [log 2 n]- 
bit hop counter to packet headers, where n is the number of processors in the 
network [16]. 

Even in the simplest network consisting of two processors and a single link, 
unbounded sequence numbers are necessary if both packet reordering and dupli- 
cation can occur [20]. In contrast, the alternating bit protocol [10, 18], which uses 
one bit headers (speciflcally, the least signiflcant bit of the sequence number), 
solves the end-to-end communication problem in this network, provided links 
cannot reorder packets. However, since intermediate vertices are memoryless, 
the alternating bit protocol cannot be directly used on links other than direct 
connections from S to R. 

For general networks in which only permanent link failures can occur, Dolev 
and Welch [11] have a protocol that uses bounded headers: if there are p sim- 
ple paths from the sender to the receiver, then O(logp) bit headers suflice. By 
combining their protocol with the alternating bit protocol performed separately 
along each simple path between the sender and the receiver, they can also handle 
packet duplication and loss. 

Adler and Fich [1] have proved lower bounds on the size of packet headers, 
as a function of the network topology, when message losses can occur. For many 
networks, including complete graphs, series-parallel graphs, and flxed degree 
meshes, these lower bounds match the packet header lengths used by Dolev and 
Welch’s protocol to within a constant factor. An open question was whether 
Adler and Fich’s lower bounds could be extended to networks that exhibit only 
permanent link failures. 

Here, we give a negative answer to this question. In fact, we prove that, for 
directed acyclic graphs, single bit headers suflice if only permanent link failures 
occur. We also extend our protocol to allow streams of messages to be sent from 
various senders to various receivers. 

These results cannot be extended to all graphs. In particular, for any graph 
that contains the complete graph on k vertices as a minor (such as the fc^-input 
butterfly or the k x k x2 mesh), any memoryless protocol that ensures delivery 
of a single message from the sender to the receiver using headers with fewer than 
|"log 2 fc] — 3 bits, generates an inflnite amount of message traflic [1]. Similarly, 
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for any graph that contains a 3 x fc mesh as a minor, J?(loglogfc)”bit headers 
are required to send a single message from the sender to the receiver [2]. 

We also consider a variant of the model where processors are not told the in- 
edge along which each packet arrives. This might be the case when each vertex 
has a single input queue rather than a separate input queue for each of its 
in-edges. In this model, we prove that 0(logd)-bit headers are necessary and 
sufficient for end-to-end communication, where d is the indegree of the graph. 

In Section 2, we present a more detailed description of the model. This is 
followed, in Section 3, by our end-to-end protocol for DAGs that uses one bit 
headers. Variants of our protocol for other problems and related models appear 
in Sections 4 and 5. Our lower bound appears in Section 6. 



2 The Model 

We model the network by a directed acyclic graph, with source S and sink R, 
Each vertex corresponds to a processor, with S corresponding to the sender and 
R corresponding to the receiver. An edge (u, u) represents to a direct commu- 
nication link from the processor corresponding to vertex u and the processor 
corresponding to vertex v. Packets can only travel in the forward direction along 
edges. Throughout the paper, we use n to denote the number of processors. 

At any point in time, a link is either operational or has failed. Once a link has 
failed, it remains so and delivers no packets that are sent along it. Operational 
links do not lose, duplicate, or reorder packets. However, they are asynchronous, 
delivering every packet within a finite but unbounded amount of time. Hence it 
is impossible to distinguish between a link which has failed and an operational 
link which is just very slow (and may contain a sequence of messages which 
it still has to deliver). We assume that there is always some directed path of 
operational links from the sender S to the receiver R; otherwise it is impossible 
to transmit any information from S to R. Processes are assumed to be reliable, 
although a process which has failed can be simulated by considering all of its 
out-edges to have failed. 

From time to time, the sender S is given a message by an external applica- 
tion which it must send to the receiver R. Eventually, R must report the exact 
sequence of messages given to S. 

When S is given a message, it creates packets containing the message and 
sends them to its neighbours. The packet headers and which packets to send 
along each out-edge can be based on its current state (but not the message 
contents). On receipt of a packet, a processor can forward the message in the 
packet by sending packets with the same or different headers along its out-edges. 
It can also send packets that only contain header information. The number of 
packets the processor sends along each out-edge, the headers of those packets, 
and whether or not a given packet contains a copy of the message can depend on 
the header of the packet and the edge along which the packet arrived. Because we 
are considering only memoryless algorithms, these decisions must be independent 
of the past history of the communication through the network. 
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Links may fail at any time and it is possible that only one directed S-R path 
of operational edges will exist. Therefore, a processor that receives a packet with 
a message must send at least one packet containing the message along each of 
its ont-edges. 

3 A Protocol with One Bit Headers 

In this section, we present a protocol to send an arbitrarily long sequence of 
messages through a directed acyclic graph G = (V^E) from a source vertex S to 
a sink vertex R. The protocol uses packets with single bit headers. 

Let d be the maximum indegree of any vertex and let / : ^ {0, . . . , d — 1} 

be any function that numbers the edges into each vertex. More specifically, if 
e = (u,u) then /(e) < indegree{v) and, if e^ = where u 7^ then 

/(e) ^ f{e'). 

Consider the following protocol: 

- When the sender S is given a message m to send to the receiver E, it sends 
a packet with message m and an arbitrary header along all its out-edges. 

- When an intermediate vertex v receives a packet p along an in-edge e, then 
along each of its out-edges, v sends the sequence of £ = [log2 indegree{v)~\ 
packets with headers 61 , . . . , 5 / and no messages, immediately followed by 
the packet p, where 61 '"be is the £-bit binary representation of /(e). In 
particular, if v has indegree 1, then v just forwards all the packets it receives 
to each of its out-edges. 

- The receiver R records the sequence of packets it receives along each in-edge. 
From this information, it determines which messages to report. 

For example, consider the following network. 




Fig. 1. A Directed Acyclic Network 



If the sender S is given only one message rn and none of the links fail, then there 
is an execution in which: 

- S and u send the packet [0, m] along all their out-edges. 
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V sends the sequence of packets [0* — 1 JO* ml JL — 1 JO* ml alone its two out- 
edges, and 

w sends the sequence of packets [0, — ] Jl, H J0,mh [1, — 1 JO, — ] JO, —h 
[l,H,[0,H,[0,m], [1,H,[0,H,[1,H, [0,H,[0,H,[0,m], [l,H,[0,H,[0,m] 
along its ont-edge. 



It remains to describe how R determines the sequence of messages to report 
from the information it records. We shall show that the messages R reports 
are in the same order that S sends them, R does not duplicate messages, and, 
provided there is an S-R path of operational edges, R reports all the messages 
that S sends. 

When an intermediate processor (i.e. a vertex other than S or K) receives a 
packet p, we say that the packets it sends in response are directly caused by p. 
The caused by relation is the reflexive transitive closure of the directly caused 
by relation. In other words, a packet p^ is caused by a packet p if ^ = P or 
y is directly caused by a packet which, in turn, is caused by p. A packet p 
is said to have followed directed path tt = ui,...,^^, if there is a sequence 
of packets pi,. .. jPk-ij where Pk^i — p, Pi travelled along edge (ui,Ui+i), for 
i = l,...,fc— 1, and packet Pi is directly caused by packet Pi^i for 1 < i < k — 1. 

For any directed path n to R in the graph, let ^(Tr) denote the vertex at the 
beginning of path tt and let 



£(7t) = |"log 2 max{l,m(iepree(^(7r))}]. 

We begin by presenting a scheme in which R stores a lot of information. This 
facilitates the proof of correctness. Then we show how R can do essentially the 
same thing while storing signiflcantly less information. 




Fig. 2. The Tree Used by R for the Network in Figure 1 



The receiver R uses a rooted d-ary tree to record and process the packets 
it has received. The nodes of the tree are the directed paths to R that start at 
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vertices reachable from S (i.e. paths that are suffixes of directed S-R paths). 
The empty path to R is the root of the tree. Node tt is the Fth child of node p 
if 7T is p preceded by the Fth edge into p, i.e., more formally, if tt = (^(Tr), s{p))p 
and /(^(tt), ^(p)) = i — 1. Since S' is a source, all directed S-R paths are leaves 
in the tree. The tree for the network in Figure 1 is given in Figure 2. At each 
node of the tree, except the root, a list of packets is kept. 

— When R receives a packet p along its Fth imedge, it appends the packet to 
the end of the list at the Fth child of the root. 

— Whenever R appends a packet p to the list at an internal node tt causing the 
length of the list to become equal to a multiple of i{n) + 1, and 6i, . . . , 

are the headers of the previous i{n) packets in the list, R also appends a 
copy of the packet p to the list at the Fth child of node tt, where 6i " ’ 6/(7^) 
is the binary representation of i — 1. In particular, if ^(tt) has only one child, 
then R immediately copies each packet in ^(7t)’s list to its childN list. 

— Whenever R appends a packet to the list at an S-R path, causing the length 
of this list to be greater than the number of messages R has reported, R 
reports the message part of this packet. 

For the execution used in the example above, where S is given only one 
message m and none of the links fail, 

- the list of packets stored at node wR is [0, — ],[1, H,[0,m], [1, H,[0, — ],[0, —], 

[l,H,[0,H,[0,m], [0,H,[0,H,[0,m], [l,H,[0,H,[0,m], 

- the list of packets stored at nodes vR and vwR is [0, — ],[0,m],[l, — ],[0,m], 
and 

- the list of packets stored at nodes SvRj uvRj SuvRj SwRj uwRj SuwRj 
SvwRj uvwRj and SuvwR is [0,m]. 

For any nonempty directed path tt = Ui, . . . , from v± — s{ 7 r) to Vk — R in 
the graph, let 



L{w)= JJ [I + log 2 max{l, indegree{vi )}~\ . 

If no links on path tt fail, each packet received by ^(tt) causes L(7 t) packets that 
follow path TT to arrive at R and each packet sent by ^(Tr) causes 

Lw/(<’w+i)= n [I + log2 max{l ^indegree{vi)}] 

l<i<k 

packets that follow path tt to arrive at R. This is because, whenever processor 
Vi ^ receives a packet, it sends 1+ |’log 2 (max{l,m(iepree(ui)})] packets along 
edge (ui,Ui+i). For the example in Figure 1, L{%)R) — L{SvR) — LipmR) — 
L{S%mR) — 2, L{wR) — L{SwR) — LipmjR) — L{S%m)R) — 3, and LipnvR) — 
L{SvwR) — LipivwR) — L{SuvwR) — 6. 

The next result provides the key invariant satisfied by our protocol. 
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Lemma 1. If R has received n packets that followed path tt^ then the list of 
packets stored at node tt is the length [n{£{7r) + 1)/L(7t)J prefix of the sequence 
of packets sent by processor 8{n). 

Proof The proof is by induction on the length of tt. If tt consists of a single 
edge into then L{n) = i{n) + 1. Since links do not reorder, duplicate, or lose 
messages, the sequence of packets that R receives along this edge is a prefix 
of the sequence of packets sent by processor ^(Tr). Processor R records all n — 
[n{i{n) + 1)/L(7t)J packets it receives along this edge in the list at node tt. Hence 
the claim is true for tt. 

Suppose the claim is true for node tt. Let tti, . . . , tt^ be the children of tt and 
let rii denote the number of packets received by R that followed path tt^. Then 
R has received n — ni P ' - P packets that followed path tt. Links do not 
reorder, duplicate, or lose packets and, if no links of tt fail, each packet received 
by ^(Tr) causes L(7 t) packets to follow path tt. These facts imply that ^(tt) has 
received at least |’n/L(7r)] packets, the first [n/L(7r)J of which have each caused 
L(7 t) consecutive packets that followed path tt. If n is not divisible by L(7 t), then 
the |’n/L( 7 r)]’th packet received by ^(Tr) has caused n — L(7t) [n/L(7r)J < L(7 t) 
packets that followed path tt. All remaining packets received by ^(Tr) have caused 
no packets that followed path tt. 

Of the first [n/L{7r)\ packets received by ^(Tr), let n[ denote the number 
received by ^(tt) along its Pth imedge, so + . . . + = [n/L(7r)J. Each of 

these n[ packets was sent by ^(ttO and caused L(7 t) packets at R that followed 
path TTi. Therefore Uj > n[L{7r). It follows that [ni/L(7r)J + -• [nk/L{7r)\ > 
ni + ... + n^= [n/L{7r)\ = -PUk)/ L{n)\ > [ni/L(7r)J + " ’+ [nfc/L(7r)J , 

so n[ = [ni/L{7r)\ , for i = 1, . . . , fc. 

By the induction hypothesis, the list of packets stored at tt is the length 
rd = [n{i{n) + 1)/L(7t)J prefix of the sequence of packets sent by processor ^(Tr). 
Processor ^(tt) sends a block of £(7 t) + 1 consecutive packets along each of its 
out-edges in response to each packet it receives. Thus, the rd packets stored at 
node TT are directly caused by the first |’nY(£(7r) + 1)] packets received by ^(Tr). 
Furthermore, the first [nY(£(7r) + 1)J = [n/L{7r)\ packets received by ^(Tr) each 
directly caused a block of £(7 t) + 1 consecutive packets stored at node tt. The 
first i{n) packets in each block indicate the edge on which the packet causing 
this block of packets arrived at ^(tt) and, hence, which processor ^(ttY sent it. 
The last packet in the block is a copy of this packet, which R copies to the 
list stored at the corresponding path tt^. Therefore, for i = 1, . . . , fc, the list of 
packets stored at node Wi is the length n[ = [nYL(7r)J = [ni(£(7rY + 1)/L(7 tYJ 
prefix of the sequence of packets sent by processor ^(Tri). □ 

Our main result follows directly from this lemma. 

Theorem 1. Consider a network whose underlying graph is a DAG and whose 
links are subject to permanent failure. There is a memoryless protocol for this 
network that uses one bit headers to perform end-to-end communication from a 
sender to a receiver. 
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Proof, From Lemma 1, the list at each leaf node is a prefix of the sequence of 
packets sent by 5. The Cth packet reported by R is the Cth packet from one 
of these lists. Thus, the sequence of packets reported by J? is a prefix of the 
sequence of packets sent by S. 

Now suppose that all the edges on some S-R path tt remain operational. 
Then each packet given to S causes L{n) packets to follow path tt to R. If S has 
been given t packets, eventually R receives tL{n) packets that followed path tt. 
By Lemma 1, the list at node tt will eventually have length t and, thus, the Cth 
packet will be reported by R, □ 

At each internal node tt, R uses the single bit headers of the £(7t) previous 
packets plus a modulo £(7 t) + 1 counter. Thus, only i{n) + |’log2(£(7r) + 1)] bits 
are needed to represent the required information at this node. Similarly, R only 
uses the length of the list stored at a leaf, hence a counter suffices in place of the 
list. In addition, R must keep track of the number of messages it has reported. 

Notice that, on receipt of a packet, each processor sends the same sequence 
of packets on all of its out-edges. Thus the protocol will work unchanged if a 
broadcast is used in place of these sends. 

4 Variants of the Basic Protocol for Related Problems 

With simple changes, the basic protocol described in Section 3 can be adapted to 
solve more general communication problems on a network with a directed acyclic 
underlying graph G = (V^E). We consider both multiple senders and multiple 
receivers and require all, one or some of the receivers to report a particular 
message. In all these cases, the resulting protocols use single bit headers. 

The simplest variant allows the sender to perform a sequence of broadcasts 
to a set of receivers. 

Theorem 2. Consider a network whose underlying graph is a DAG and whose 
links are sub feet to permanent failure. There is a memoryless protoeol for this 
network that uses one bit headers to perform a sequenee of broadcasts from a 
sender to a set of receivers. 

Proof Let 01 denote the set of receivers. Each vertex u G 01 behaves like the 
single receiver R^ as described in Section 3, using a tree data structure rooted 
at the empty path to u. Those vertices in 01 that are not sinks also send packets 
along their out-edges in response to receiving a packet. □ 

It is possible to extend this protocol to handle the situation where different 
messages can be sent to different receivers. 

Theorem 3. Consider a network whose underlying graph is a DAG and whose 
links are subject to permanent failure. There is a memoryless protocol for this 
network that uses one bit headers to perform end-to-end communication from a 
sender to a set of receivers. 
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Proof. Imagine that, for each receiver u E X, there is one source node ^ 
which has a single out-edge leading to 5. To send a message to a, the sender 
S pretends that it has received a packet from containing the message. This 
causes S to send [log 2 |3^|] packets, each consisting of a single header bit, which 
together form the [log 2 |3^|]”bit binary representation of the index of 
among the in-edges of 5, followed by a packet containing the message, packet 
could contain the message. In effect, the message is preceded by packets that 
specify the receiver to whom it is directed. Each receiver u E X only keeps track 
of the leaves corresponding to paths and reports a new message whenever 
the maximum of the lengths of the lists at these leaves increases. □ 

Another problem is to handle a set of senders S, each of which may have a 
sequence of messages to transmit to the receiver R. Here R separately reports 
the messages received from each processor in S. A similar modification to the 
basic protocol will handle this situation, provided that each processor in $ has 
at least one nonempty directed path to R. 

Theorem 4. Consider a network whose underlying graph is a DAG and whose 
links are sub feet to permanent failure. There is a memoryless protoeol for this 
network that uses one bit headers to perform end-to-end eommunication from a 
set of senders to a reeeiver. 

Proof Imagine a source node S ^ V that is connected to all the vertices in $ 
and is the source of all messages. This increases the indegree of every vertex in 
S by 1. Each vertex a € S that wants to send a message to R behaves as if S had 
just sent it a packet containing the message. Then, each packet R receives that 
appears to have followed a path beginning with edge (S', a), actually originated 
at vertex v. 

Now R uses |S| counters, one to keep track of the number of messages origi- 
nating at each of the vertices in S. Whenever R copies a packet to the list at a 
leaf (i.e. at a node corresponding to a path tt from the imaginary source node S 
to R)^ R compares the new length of this list to the number of messages it has 
reported that originated from a, the second vertex on path tt. If it is larger, then 
R reports the message part of this packet to be the next message that originated 
from V. 

Note that, if all of the senders in $ are source nodes in the network, the 
imaginary source node S is not needed. In this case, the leaves in the tree are 
all the directed paths to R from nodes in S. □ 

The ideas in Theorems 3 and 4 can be combined to handle multiple senders 
and receivers. 

Theorem 5. Consider a network whose underlying graph is a DAG and whose 
links are subject to permanent failure. There is a memoryless protoeol for this 
network that uses one bit headers to perform end-to-end communication from a 
set of senders to a set of receivers. 
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Proof, Let denote the set of receivers and let $ denote the set of senders. As 
in the proof of Theorem 3, for each receiver u E 31, there is an imaginary source 
node ^ V and the leaves in u^s tree are all paths. Each sender a € S 
has an in-edge {ufv) for each receiver u E 31 with which it may communicate. 
Whenever v has a message to send to u E 31, it pretends that it has received a 
packet from containing the message. 

As in the proof of Theorem 4, each receiver u E 31 has one counter for 
each sender u € S that may communicate with it, to keep track of the number 
of different personal messages it has received originating from v, i.e. following 
paths that seem to begin with edge {u^,v). Whenever one of these counters is 
incremented, u reports the message from the packet under consideration. □ 

A simple modification extends this protocol to handle multicasts, where 
senders may wish to broadcast messages to subsets of the receivers. 

Theorem 6. Consider a network whose underlying graph is a DAG and whose 
links are subjeet to permanent failure. There is a memoryless protoeol for this 
network that uses one bit headers to perform a sequenee of multieasts from a set 
of senders to a set of reeeivers. 

Proof The idea is to have a separate imaginary source node for each desired 
subset of receivers, with edges directed from it to each sender that may wish 
to broadcast to this subset of receivers. Each leaf in uA tree is a path to u 
originating from an imaginary node that represents a set which contains u, □ 

As was the case for the basic protocol described in Section 3, all of the 
protocols described in this section will work unchanged if broadcasts are used 
instead of sends. 

5 Variants of the Protocols for Other Models 

In some models, processors do not have built-in access to information about 
the source of each of their incoming packets. In other words, processors are 
not told on which of their in-edges each packet arrives. We call such models 
souree oblivious. Source oblivious models may be appropriate when each vertex 
has a single input queue, rather than a separate input queue for each of its in- 
edges. Any of the previous protocols can be modified to work in source oblivious 
models by using longer headers. For example, the following result is the analogue 
of Theorem 1. 

Theorem 7. Consider a network whose underlying graph is a DAG with inde- 
gree d and links that are subjeet to permanent failure. There is a memoryless, 
souree oblivious protoeol for this network that uses headers of length 1 + |"log 2 d] 
to perform end-to-end eommunieation from a sender to a reeeiver. 

Proof The longer headers allow the protocol described in Section 3 to be simu- 
lated in this weaker model. Specifically, when a processor sends a packet along 




370 



F.E. Fich and A. Jakoby 



edge e to processor Vj it includes the [log 2 indegr ee{v)~\-hit binary representa- 
tion of /(e) as part of the header. When v receives the packet, it strips off this 
information and uses it to identify the in-edge along which the packet arrived. 

□ 

Notice that processors send different packets along their different out-edges. 
Therefore broadcasts do not suffice for this protocol. If each processor has a 
single input queue and can only broadcast packets to its out-neighbours, there 
is another variant of the protocol that can be used, instead. 

Theorem 8. Consider a network whose underlying graph G = (1/ £J) is a DAG 
with n vertices^ indegree d, outdegree and links that are subjeet to perma- 
nent failure. There is a rnernoryless^ source oblivious protocol for this network 
that uses headers of length 1 + |"log 2 min{n, G(d — 1) + 1}] to perform end-to- 
end communication from a sender to a receiver^ where processors only broadcast 
packets. 

Proof, Consider the hypergraph where 

= {in-neighbours(u) | v € V}. 

The degree in of vertex v e V is the number of nodes that share at least one 
hyperedge with v and degree{G^) = max^j^v" degree{v). Since the indegree of G 
is d and its outdegree is G, it follows that degree{G^) < min{n^ D{d — 1) + 1}. 

A strong colouring is a function assigning each vertex a colour so that no 
hyperedge contains two vertices of the same colour. There is a strong colouring 
g :V ^ {0, 1, . . . , c — 1} of using at most c = 1 + degree{G^) colours [12]. 

As in Theorem 7, the longer headers allow the protocol described in Section 3 
to be simulated. When a processor u wants to send a packet to its out-neighbours, 
it includes the |"log 2 c]-bit colour g{u) as part of the header. Let v be an out- 
neighbour of u. Since all of uN other in-neighbours are contained in the hyperedge 
in-neighbours (a) together with a, they all have a colour different from g{u). 
Therefore, when processor v receives the packet, it can strip off the colour from 
the header and use this information to identify its in-neighbour u that sent the 
packet. □ 

6 A Lower Bound 

In this section, we prove that headers of length |"log 2 d] are necessary for end- 
to-end communication in a DAG of indegree d using source oblivious protocols, 
i.e. when processors are not told the in-edge along which each packet arrives. In 
other words, the behaviour of a processor on receipt of a packet cannot directly 
depend on the in»edge on which the packet arrives, but only on the packet header 
(and, for processor Ji, its state). 

Our lower bound is for the class of dMa oblivious protocols, where proces- 
sors (including S and R) do not perform actions based on the contents of the 
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messages being sent [1,6]. In particular, processors cannot compare messages or 
perform computation on them. This is appropriate when one views end-to-end 
communication protocols as providing a reliable communication layer that will 
be used by many different distributed algorithms. 

Also, it is necessary to assume that there is a bound on the number of packets 
sent by S when it is given a message to send to R. Otherwise, there are count- 
ing protocols [3, 18] that use no headers. In these protocols, for each successive 
message in the sequence it wants to send to the receiver Ji, the sender S broad- 
casts an exponentially increasing number of packets containing that message. 
Aloreover, R performs equality tests on the packets. 

The following lower bound shows that each processor must be able to dis- 
tinguish between the packets arriving on different in-edges, so that R can avoid 
confusing packets caused by different messages. 

Theorem 9. Consider any network whose underlying graph G is a DAG with 
a single souree 5, a single sink J?, and indegree d. Then any mernoryless^ souree 
oblivious^ data oblivious protocol for sending an arbitrarily long sequence of mes- 
sages from S to R in this network must use at least d different headers (and hence 
have header length at least |"log 2 d]). 

Proof To obtain a contradiction, suppose there is a protocol for end-to-end 
communication from S' to -R in G that is memoryless, source oblivious, data 
oblivious, and uses fewer than d different headers. Let H be the set of different 
packet headers used by the protocol. 

Let a be a vertex of G with indegree d and let I denote the set of edges into 
V. Let 7T be a directed path from a to R and let T be a tree of directed edges 
connecting S to all the in-neighbours of v. Suppose that all edges in tt, T, and I 
are operational and all others edges have failed. The edges in tt and T will have 
delay 0, but the speed of the edges into v will be controlled by an adversary. 

Suppose that S is given an infinite sequence of different messages to send 
to R. Consider the resulting infinite sequence of packets R(e) that travel along 
each edge e € /. Let H{e) denote the set of packet headers that occur infinitely 
often in R(e). 

Consider the bipartite graph with vertex set / U R and an edge between 
eel and h e H if and only if h e H{e). Let M be any maximal matching of 
this graph. Since |/| > |R|, there exists Pel which is unmatched. Since M is 
maximal, each h e H{P) is matched; otherwise edge (Pjh) could be added to 
the matching. Let m{h) e I denote the match of h for each h e H{P). 

Let B be an upper bound on the number of packets in R(e) caused by a 
single message given to 5, for all eel. This value must exist because there 
is a bound on the number of packets sent by S when it is given a message, 
and the intermediate processors are memoryless, so each packet received by 
an intermediate processor can directly cause only a finite number of outgoing 
packets. Let k be sufficiently large so that 

- the prefix of P{P) caused by the first k messages contains all headers that 

do not occur in P{P) infinitely often (i.e. all h e H — R(e^)), and 
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- for each h € the prefix of P{m{h)) caused by the first k messages 

contains at least B occurrences of header h. 

The adversary constructs two executions with the property that R reports 
the wrong sequence for at least one of them. For both executions, the adversary 
begins by giving S the first k messages. It allows all packets on edge to travel 
with delay 0, but makes all other edges e € / so slow that not one packet has 
yet arrived at v along any of them. 

Then the adversary gives message fc + 1 to 5. Suppose this message causes 
a sequence of packets on edge with headers . . . , where b < B. Before 
delivering the next packet to v along edge the adversary delivers packets 
to V from P{m{hi)) (i.e. along edge m{hi)) up to but not including the first 
occurrence of hi. In the first execution, the adversary delivers the packet with 
header hi along edge followed by the packet with header hi along edge m{hi). 
In the second execution, the adversary delivers these two packets to v in the 
opposite order. This continues for the remaining b — 1 packets on edge eh 

Next, the adversary causes all of the edges in / — {e^} to fail and the edge e^ is 
given 0 delay. The adversary gives messages to S until R reports fc + 1 messages. 
Since the protocol is data oblivious, these two executions are indistinguishable 
to V and, hence, to R. 

To be correct, the last message R reports must be taken from one of the 
packets caused by message fc + 1. By construction, all such packets are caused 
by packets that travelled along edge eh In fact, by the choice of fc, all packets 
received by v along edges other than e^ in either of the two executions, are 
caused by one of the first k messages. Thus, for each packet caused by message 
fc + 1 that travelled along edge in one of the two executions, the corresponding 
packet in the other execution is not caused by message fc+ 1. This implies that, 
in at least one of the two executions, the last message R reports is taken from a 
packet not caused by message fc + 1 and, hence, is incorrect. □ 

Using similar techniques, we can prove a lower bound on the number of 
packet transmissions caused by one message in any memoryless, data oblivious, 
end-to-end protocol with single bit headers that is close to the number caused 
by one message in the basic protocol, given in Section 3. 
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Abstract. Distributed object caching is essential for building and de- 
ploying Internet wide services based on middlewares such as CORBA. 
By caching objects, it is possible to mask much of the latency associated 
with accessing remote objects, to provide more predictable quality of 
service to clients, and to improve the scalability of the service. This pa- 
per presents a combined theoretical and practical view on specifying and 
implementing consistency conditions for such a service. First, a formal 
dehnition of a set of basic consistency conditions is given in an abstract, 
implementation independent manner. It is then shown that common con- 
sistency conditions such as sequential consistency, causal consistency, and 
PRAM can be formally specihed as a combination of these more basic 
conditions. Finally, the paper describes the implementation of the pro- 
posed basic consistency conditions in CASCADE, a distributed CORBA 
object caching service. 



1 Introduction 

Object caching is a promising approach for improving the scalability, perfor- 
mance, and predictability of Internet oriented services that are based on object- 
oriented middlewares such as CORBA [17]. Accessing a local or nearby cache 
incurs a much lower latency than accessing a far away object, and the access time 
and availability of a cached copy is much more predictable than when accessing 
a remote object. Also, object caching greatly enhances the scalability of services 
because most client requests can be satisfied from a local cache and the service 
provider is relieved from the burden of servicing a large number of concurrent 
clients. 

An inevitable side-effect of caching is the need to maintain copies of the same 
cached object consistent, at least to some degree. In this paper, we explore, both 
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theoretically and practically, a flexible approach to consistency of snch services. 
That is, we start by formally defining a basic set of consistency conditions, in 
an abstract, implementation independent, way, and show how other consistency 
conditions can be implemented as a combination of these basic conditions. We 
then describe how we have implemented these conditions within CASCADE [7], 
onr CORBA object caching service The paper also describes some interesting 
optimizations we employed in this implementation. 

Onr specifications and resnlting implementation have the following distinct 
featnres: 

Rigor: We provide a formal definition of basic consistency conditions, given 
from the application point of view, as reqnirements on possible ordering of 
clients’ local histories. As discnssed in [4], this implementation independent 
approach yields more rigorons definitions, and it is easier to prove program 
correctness with snch definitions than with operational definitions. 
Modularity: Onr conditions can be combined in varions ways to yield gnaran- 
tees with different levels of strength and complexity. This approach allows 
the known tradeoff between the strength of the consistency semantics and 
the overhead it imposes (cf. [4]) to be taken into consideration when config- 
nring the set of consistency gnarantees for a particnlar application. For nsers 
of onr service, this means that they have more freedom in choosing the exact 
consistency semantics they need. From the implementation standpoint, this 
yields a more modnlar implementation. Since the implementation can be di- 
vided into basic conditions, each of which is easier to implement than, say 
seqnential consistency [12], the entire implementation is simpler, and there- 
fore more robnst. Similarly, the implementation correctness proof is easier, 
since we can prove the correctness of the implementation of each basic condi- 
tion separately; the formal proof abont the combination of these conditions 
then immediately implies that onr service correctly implements the corre- 
sponding more elaborate consistency conditions, e.g., seqnential consistency. 
Comprehensiveness and usefulness for applications: The presented spec- 
ifications cover a wide range of consistency reqnirements for distribnted ap- 
plications. This is shown by proving that many existing consistency con- 
ditions snch as seqnential consistency [12], PRAM [13], and cansal consis- 
tency [2] can be specified as certain combinations of onr basic conditions. 
We also discnss nsefnlness of other combinations and analyze the inter- 
dependencies within the set of gnarantees. In the fnll version of this paper we 
present examples of several applications, each of which reqnires some of onr 
gnarantees or a combination of them. Moreover, we show there that all of 
onr conditions are indeed nsefnl, i.e., that there are applications that reqnire 
each of them. 

^ In [7], we described CASCADE, the motivation behind it, its general implementation 
and a performance analysis. The current paper is the first place where we formally 
specify the basic consistency conditions, and elaborate on their exact implementation 
within CASCADE. 
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Although, our implementation of consistency conditions is based on the 
widely known notion of version number and vectors, it is nevertheless unique 
in exploiting the peculiarities of hierarchical cache architecture such as the one 
used in CASCADE. We envision that similar techniques can be applied to other 
systems that employ hierarchical caches. 

Finally, our implementation preserves consistency guarantees even when 
clients access cached copies at different servers during the execution. Further- 
more, we have designed novel optimizations that reduce the amount of infor- 
mation transferred between roaming clients and static servers. We believe that 
this latter contribution can be applied to other systems where it is required 
to maintain consistency for mobile clients. As Internet mobile clients become 
more common, we expect that our techniques will be useful for a wider range of 
applications. 



1.1 Related Work 

Many consistency conditions have been defined and investigated, mostly in the 
context of distributed shared memory, e.g., [2,4,9, 12, 13, 16, 18] and databases, 
e.g., [14,21]. Vast amount of research was dedicated to implementing shared 
memory systems with various consistency guarantees, including sequential con- 
sistency (sometimes referred to as strong consistency) [12], weak consistency [10], 
release consistency [6], causal consistency [2], lazy release consistency [11], en- 
try consistency [5], and hybrid consistency [8]. In contrast to our service, such 
systems are geared towards high-performance computing, and generally assume 
non-faulty environments and fast local communication. 

Much less attention, however, was devoted to exploring consistency guar- 
antees suitable for object-oriented middlewares, especially for middlewares in 
which a client is not bound to a particular server and can switch the servers all 
the time. 

Our work is motivated by Bayou project [21], which introduced a set of basic 
consistency conditions for sessions of mobile clients and discussed version vectors 
as a possible way of their implementation. This work also brought numerous ex- 
amples illustrating that these conditions are indeed useful for applications. How- 
ever, these definitions are introduced in [21] as constraints on an implementation 
and are defined in a framework of a particular database model. 

The Globe system [22] follows an approach similar to CASCADE by pro- 
viding a flexible framework for associating various replication coherence models 
with distributed objects. Among the coherence models supported by Globe are 
the PRAM coherence, the causal coherence, the eventual coherence, etc. 

2 Definitions and Conventions 



We generally adopt the model and definitions as provided in [3] and [1], but 
slightly adjust them to our needs. We assume a world consisting of clients and 
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servers. Clients invoke methods on objects as specified in n program. These meth- 
ods are then transformed into messages sent to one or more servers. The servers 
can exchange messages among themselves and eventnally send a reply to the 
client. We assnme that message delivery is (eventnally) reliable and FIFO, and 
that processes do not fail. However, the network might delay messages for an ar- 
bitrarily long time and neither clients nor servers have access to real-time clocks. 
The rationale behind this failnre model is discnssed in the fnll version of this 
paper. 

We assnme that each method operates on one object, bnt each object might 
have several read/ write variables. Onr definitions below are given from the client 
point of view and thns, for the rest of this section, we will no longer discnss 
servers; servers will be important in discnssing the implementation (Section 4 ). 
Also, onr definitions and discnssions assnme one object Note that since each 
object has mnltiple variables, each object can be thonght of as a single distribnted 
shared memory. 

We assnme that methods can be classified as either qneries or npdates, de- 
pending on whether they simply retnrn the valne of variables they access, or 
change them. To make the definitions comparable to the ones nsed in distribnted 
shared memory research, we will refer to npdates as WRITES and to qneries as 
READS. Each method can either read or write several variables atomically. In par- 
ticnlar, a single READ operation might retnrn valnes written by several WRITE 
operations. 

A local execution of a client process pi, denoted cr*, is a seqnence of READ and 
WRITE operations, denoted 01,02,..., that are performed by pi. We assnme that 
clienFs operations are always ordered in its local history in the order specified 
in the program. For the sake of simplicity, we will omit the variables accessed 
by an operation whenever possible. In what follows, we sometimes refer to local 
execntion as session, and nse these terms interchangeably. A global execution, or 
jnst execution a, is a collection of local execntions for a given system rnn, one 
for each client of the system. 

s 

Given a seqnence S of operations, we denote oi — 02 when oi precedes 02 

a 

in the seqnence. An execntion cr indnces a partial order, — , on the operations 
that appear in cr: oi — 02 if oi — 02 for some pi. 

For a given execntion cr and a process pi, denote by cr\i the restriction of 
cr to events of pp denote by cr\i -\- w the partial execntion consisting of all the 
operations oi pi and all the WRITE operations of other processes. Similarly, for a 
given seqnence S of operations, denote by N|i the restriction of S to operations 
invoked by pi and denote by S\w the restriction of S to WRITE operations. 

We nse the standard notions of serializations, legal serializations and consis- 
tency conditions as defined, e.g., in [ 4 ]. 



^ This is sufficient for CASCADE in which consistency conditions are indeed provided 
per each object since each object has a separate hierarchy. However, in the future it 
would be interesting to extend the definitions to multiple objects. 
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3 Consistency Conditions 

3.1 Basic Consistency Conditions 

Eventual Propagation: For every process pi there exists a legal serialization 
Si oi cr\i -\- w. 

This requirement essentially expresses liveness of update propagation: For a 
given execution and a given update in this execution, if some process invokes 
an infinite number of queries, it will eventually see the result of this update. An 
implementation in which updates are not propagated does not guarantee any 
level of consistency. Henceforward, we assume that this condition always holds. 

Let us define a serzaUzatzon set o/cr as a set of legal serializations of cr\i + m, 
one for each pi. Due to Eventual Propagation, at least one serialization set exists 
for a given execution. 

We now present five session guarantees. Each guarantee is defined as a pred- 
icate that takes a serialization or a serialization set and verifies whether this set 
satisfies the condition w.r.t. a session. 

Read Your Writes: Eor a given execution cr and a process pi, a valid serial- 
ization Si of cr\i -h w preserves Read Your Wrztes for the sesszon (Ti if for 
every two operations o\ and 02 in cr* such that o\ — WRITE, 02 — READ, and 

<7^ A, 

oi — 02, holds oi — 02. 

FIFO of Reads: For a given execution cr and a process pi, a valid serialization 
Si of cr\i -h w preserves FIFO of Reads for the sesszon (Ti if for every two 

O' I. 

operations o\ and 02 in cr* such that o\ — READ, 02 — READ, and o\ — 02, 
holds o\ — 02- 

FIFO of Writes: For a given execution cr and a process pi, a serialization set 
S = {Sj} preserves FIFO of Wrztes for the sesszon (Ti if for every two oper- 

O'l. 

ations o\ and 02 in cr* such that o\ — WRITE, 02 — WRITE, and o\ — 02, 
holds Spj oi — 02- 

Reads Before Writes: For a given execution cr and a process pi, a valid se- 
rialization Si of a \i w preserves Reads Before Wrztes for the sesszon (Ti if 

for every two operations o\ and 02 in cr* such that o\ — READ, 02 — WRITE, 

<7^ A, 

and o\ — 02, holds o\ — 02. 

Session Causality:^ For this definition we assume that no value is written more 
than once to the same variable. For a given execution cr and a process pi, 
a serialization set S = {Sj} preserves Sesszon Causalzty for the sesszon (Ti 
if for every three operations oi, 02 and 03 such that 02 and 03 are in cr*, 

01 — WRITE, 02 — READ, 03 = WRITE, 02 read a result written by o\ and 

< 7 . 

02 — ^ 03, holds Spj oi — ^ 03. 



^ Called Writes Follow Reads in [21] 
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As noticed in [ 21 ], while Read Yonr Writes, FIFO of Reads and Reads Before 
Writes only affect the sessions for which they are provided. Session Cansality 
and FIFO of Writes contain gnarantees w.r.t. the execntions of other processes. 
Accordingly, we define the former conditions for a single serialization and the 
latter conditions for a serialization set. However, the following definitions reqnire 
the same form for all the conditions. Therefore, we assnme below that Read Yonr 
Writes, FIFO of Reads and Reads Before Writes are defined for a serialization 
set in which only a single serialization is nsed in the definition (the definitions 
in this latter form can be fonnd in the fnll version of this paper) . 

For any condition Y of these five session properties and a given execntion cr, 
we say that a serialization set S = {Sj} globally preserves X if it preserves Y 
for all the sessions cr* G cr. 

We now introdnce a definition of the Total Order condition: 

Total Order: For a given execntion cr, a serialization set S = {Sj} globally 
preserves Total Order if for every two serializations Si and Sj in S, Si\w = 
Sj\w. 

For a given execntion cr, a serialization set S = {Sj} globally preserves some 
set of the conditions defined above if S globally preserves each condition in this 
set. Finally, we say that an execntion cr is consistent with respect to a condition 
set (or a single condition) Y if there exists a serialization set S of cr snch that S 
globally preserves Y. We say that an implementation A obeys a condition set (or 
a single consistency condition) Y if every execntion generated by A is consistent 
with respect to Y. 

3.2 Examples of Known Consistency Conditions 

The following is a list of several important and well known consistency condi- 
tions: 

Sequential Consistency (SC) [12]: An execntion cr is seqnentially consistent 
if there exists a legal serialization A of cr snch that for each process p*, 
cr\i = A|i. 

PRAM Consistency [13]: An execntion cr is PRAM consistent if for every 
process pi there exists a legal serialization Si of cr\i -\- w snch that if o\ and 

(7 Si 

02 are two operations in cr\i w and oi — 02, then oi — 02. 

Note that instead of reqniring a legal serialization Si for every process pi this 
definition can be rephrased to reqnire an existence of a serialization set. We 
will nse this latter form in order to define conjnnction of PRAM consistency 
with other consistency conditions, e.g., in the theorems below. This latter 
form also appears in the fnll version of this paper. 

Causal Consistency [2]: For the definition of cansal consistency we assnme 
that no valne is written more than once to the same variable. Given an 
execntion cr, an operation oi directly precedes 02 (denoted oi — y 02) if 

a 

either oi — 02 or oi = WRITE, 02 = READ, and 02 read a resnlt written by 
oi. Let — y denote the transitive closnre of — y. 
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An execution cr is causally consistent if for every process pi there exists a 
legal serialization Si of cr\i + w such that Si respects — i.e., if o\ and 02 

are two operations in cr\i -\- w and o\ — y 02, then o\ — 02. 

3.3 Discussion 

Most consistency implementations preserve Reads Before Writes mainly because 
in most implementation a READ operation is blocking and execution is resumed 
only after a result is returned. We bring this condition here, however, for com- 
pleteness and because it plays an important role in dependencies between con- 
sistency conditions. In the future, we intend to investigate the implications for 
the systems in which this condition does not hold. 

Any single condition that relates two events of the same type is trivial by 
itself. For example, if we only require FIFO of Reads, then naturally we can 
always find legal serializations in which all reads are ordered in FIFO order. 
This is because we have not placed any requirements on writes, and thus we 
have the freedom to order the writes in the serialization so all the reads are 
legal. This applies similarly also to FIFO of Writes, Session Causality and Total 
Order. Thus, these guarantees become meaningful only in combinations that 
contain several guarantees of different types. The only guarantee that is not 
trivial by itself is Read Your Writes. 

We now present several theorems that show how some combinations of the 
basic consistency conditions relate to each other and to other known consistency 
conditions. The proofs of these theorems can be found in the full version of this 
paper. 

Theorem 1. Any execution that is consistent w,r\t. Total Order and Reads Be- 
fore Writes IS also consistent w.r.t. Session Causality 

Conclusion: Since Reads Before Writes holds in almost all implementations, the 
practical meaning of this theorem is that Total Order implies Session Causality. 

Theorem 2. Any execution that is consistent w,rT, FIFO of Writes, FIFO of 
Reads, Read Your Writes and Reads Before Writes is also PRAM consistent, 
Vifse versa, any PRAM consistent execution is also consistent w,rT, FIFO of 
Writes, FIFO of Reads, Read Your Writes and Reads Before Writes, 

Theorem 3. Any execution that is PRAM consistent and is consistent w,r,t. 
Session Causality is also causally consistent, Viese versa, any causally consistent 
execution is also consistent w,r,t. Session Causality and PRAM, 

Theorem 4. Any execution that is PRAM consistent and is consistent w,r,t. 
Total Order is also sequentially consistent Vice versa, any sequentially con- 
sistent execution is also consistent w,r,t. Total Order and PRAM, 

^ Note that being consistent w.r.t. a set of properties is a stronger property than just 
being consistent w.r.t. each property in the set. 

^ This claim can also be derived from the results of [18] whose focus, however, is 
different from ours. 
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4 Implementation of Consistency Conditions in 
CASCADE 

The general implementation of CASCADE has been presented in [7], bnt with- 
ont specific details abont the snpport for the basic consistency conditions that 
were presented in Section 3. We start this section by covering general elements of 
CASCADE architectnre that are needed to provide the right context for describ- 
ing consistency implementation. Then, we explain in detail how each individnal 
consistency condition is implemented in CASCADE. 

However, dne to lack of space, we do not present here any psendo code or 
proofs that onr implementation obeys a given combination of basic consistency 
conditions; these appear in the fnll version of the paper. Enrthermore, we do not 
discnss Reads Before Writes in this section: As explained in Section 3.3, Reads 
Before Writes trivially holds in any natnral implementation. 

4.1 Hierarchical Caching in CASCADE 

A detailed description of CASCADE architectnre can be fonnd in [7]. Here we 
only briefly snmmarize the design choices that are important for consistency 
implementation: The service is provided by a nnmber of servers each of which 
is responsible for a specific logical domain. In practice, these domains can cor- 
respond to geographical areas. We call these servers Domain Caching Servers 
(DCSs). 

Cached copies of each object are organized into a hierarchy. A separate hier- 
archy is constrncted for each object. The constrnction mechanism ensnres that 
for each client, clienCs local DCS (i.e., the DCS responsible for the clienCs do- 
main) obtains a copy of the object. In addition, this mechanism attempts to 
gnarantee that the object copy is obtained from the nearest DCS having a copy 
of this object. Once the local DCS has an object copy, client reqnests for object 
method invocation normally go to this DCS, so that the client does not have 
to commnnicate to a far server. Only if the local DCS becomes overloaded or 
nnavailable, the client can decide to switch to another DCS. While we connt for 
snch a possibility in CASCADE, we consider it an nnlikely event. Therefore, onr 
implementation is optimized for the case when the client commnnicates with a 
small nnmber of DCSs dnring its execntion (see Section 4.4). 

4.2 Implementation of Eventual Propagation and Total Order 

CASCADE always gnarantees Eventnal Update Propagation while the nse of 
other conditions can be controlled by the application. To gnarantee Eventnal 
Update Propagation, qneries are always locally execnted at the DCS a client 
commnnicates to and npdates are propagated throngh the hierarchy. However, 
the way npdates propagate and the order in which they are being applied depend 
on whether Total Order is reqnired. 

If Total Order is not reqnired by the application, Eventnal Propagation is 
implemented as follows: A DCS that receives an npdate reqnest from a client 
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applies it locally and sends it to all its neighbors in the hierarchy in parallel. A 
DCS that receives an npdate reqnest from a neighbor DCS A applies the npdate 
and performs flooding, i.e., sends the reqnest to all its neighbors bnt A. Note 
that this propagation protocol preserves per-DCS FIFO of npdates becanse all 
the links are FIFO (as specified in Section 4.1) and becanse there is only one 
path in the hierarchy between any pair of nodes. Fnrthermore, per-DCS Session 
Cansality also holds: If a DCS receives and applies an npdate, and then some 
client qneries the object state and issnes another npdate at this DCS, then the 
second npdate will be broadcast to the neighbors of this DCS after the first one. 
We will show later in this section how these facts can be exploited in order to 
provide an efficient implementation of session gnarantees. 

The Totally Ordered Eventnal Propagation (i.e., the Total Order -h Eventnal 
Propagation conditions) is implemented as follows: Updates first ascend throngh 
the hierarchy towards the root. The root of the hierarchy orders the npdates in 
a seqnence, applies them and propagates ordered npdates throngh the hierarchy 
downwards towards the leaves. 

Note that this implementation of Total Order is not affected by presence or 
absence of application demand for other consistency conditions. Moreover, this 
implementation is entirely based on the DCS algorithm and inter-DCS protocol, 
and does not reqnire any client involvement. 

Since onr goal is to address Internet applications, where extremely long delays 
are common, we have made the design choice that npdate reqnests can retnrn 
before the npdate has traversed the entire object hierarchy. The resnlt, however, 
is that the implementation of session conditions reqnires client cooperation in 
most cases. 

Also, the implementation of the session conditions adapts itself to the set 
of consistency reqnirements chosen by the application. In particnlar, their im- 
plementation is significantly affected by presence or absence of Total Order. 
Therefore, we discnss their implementation with and withont Total Order sepa- 
rately. 

4.3 Implementing Session Guarantees in Presence of Total Order 

The implementation of the session gnarantees is greatly simplified by the pres- 
ence of the Total Order implementation. Eirst, Session Cansality is achieved for 
free, as Theorem 1 implies. Second, the root of the hierarchy can assign each 
npdate a global update identifier th^it serves as a version nnmber of the object. 
Hence, an object version can be identified by a single nnmber. As a resnlt, the 
implementation of session gnarantees becomes simpler, less information needs to 
be stored at both clients and DCSs, and most important, less consistency related 
data needs to be transferred between a client and a DCS per method invocation. 

Specifically, with each qnery resnlt, a DCS retnrns to the client the nnmber 
of the object version this qnery sees. It wonld be more complicated to handle 
npdates in a similar way becanse npdates have to be propagated first to the root 
DCS which assigns them an npdate identifier. In principle, a DCS that received 
an npdate reqnest from a client can block the client nntil the npdate identifier 
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is received from the root DCS and then pass this version nnmber to the client. 
This way, the only consistency information to be transferred between a client and 
a DCS wonld be a single global version nnmber. However, since CASCADE is 
intended to operate in a WAN environment and the propagation latency between 
a client DCS and a root DCS may be qnite significant, this solntion may block 
the client for prohibitively long. 

Therefore, CASCADE adopts an alternative identifier scheme for npdates: 
Each DCS maintains a connter of npdates originated at this DCS and each 
npdate is assigned a local npdate identifier consisting of the DCS identifier and 
a connter valne. In contrast to the global version nnmbers, two local npdate 
identifiers assigned by different DCSs are incomparable. When a client invokes 
an npdate reqnest on a DCS, the DCS immediately prodnces a new local npdate 
identifier and retnrns it to the client. 

Eor implementation of some session gnarantees we need to maintain a version 
vector for an object with one entry per DCS in the hierarchy; each entry in this 
vector corresponds to the last local npdate identifier received from the corre- 
sponding DCS. Version vectors are maintained in the following way: When an 
npdate ascends throngh the hierarchy towards the root, the local npdate identi- 
fier is piggybacked on the npdate message. When this npdate is propagated from 
the root towards the leaves, its global version nnmber and local npdate identifier 
are both piggybacked. Upon receiving and applying this npdate, a DCS npdates 
its cnrrent object version nnmber and version vector. 

We now describe the individnal implementation of the three session gnaran- 
tees that reqnire a non-trivial implementation: 

FIFO of Reads: As previonsly explained, with each qnery resnlt, a DCS re- 
tnrns to the client the objecUs version nnmber that this qnery sees. The 
client passes this nnmber to a (possibly different) DCS npon its next qnery. 
This DCS does not apply the qnery and blocks the client nntil it receives 
and applies the npdate referred to by the version nnmber (in other words, 
the DCS synchronizes the qnery with the version nnmber). 

FIFO of Writes: For implementing FIFO of Writes, the root DCS shonld main- 
tain a version vector which contains the last local npdate identifier received 
from each DCS. Keeping only the last npdate identifier is snfhcient becanse 
Total Order preserves per-DCS FIFO of npdates: Two npdates issned at the 
same DCS reach the root where they are ordered in order of their issnance. 
As previonsly explained, when a client invokes an npdate reqnest on a DCS, 
the DCS transfers a local npdate identifier back to the client. The client only 
remembers the last local identifier it received from some DCS and forgets 
all previons local identifiers. This is snfhcient becanse FIFO of Writes is a 
transitive relation and it is enongh to remember only the last predecessor. 
The client passes the last known local identiher to a (possibly different) DCS 
npon the invocation of its next npdate reqnest. The DCS piggybacks this 
identiher on the npdate message that traverses the hierarchy towards the 
root. The root DCS compares this identiher against the version vector and 
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blocks the message until the update referred to by the identifier is received 
and applied. 

Read Your Writes: For implementing Read Your Writes, each DCS maintains 
a version vector. When a client invokes a query request on a DCS, it passes 
the local update identifier(s) of the last update(s) it initiated. The DCS 
synchronizes the query with these identifiers based on the information stored 
in its version vector. 

If Read Your Writes is provided along with FIFO of Writes, one last local 
update identifier is sufficient to be synchronized with because FIFO is a 
transitive relation. Otherwise, for each DCS the client sent an update re- 
quest to, it should remember the last local update identifier received from 
this DCS. In this case, the query must be synchronized with the entire set 
of identifiers. However, since we assume in the model that a client only com- 
municates with a small subset of all existing DCSs in the object hierarchy, 
the set of identifiers is also small and its transfer between a client and a DCS 
is not an expensive operation. 

If Read Your Writes is provided along with FIFO of Reads, the amount of 
information to transfer and store at a client can be optimized in a different 
way: The client should only remember the local update identifiers it received 
since the last query. If the client first issues several updates and then two 
queries, the first query will be synchronized with the updates and the sec- 
ond query will be synchronized with the first one. Therefore, no explicit 
synchronization of the second query with the updates is necessary in this 
case. 

In summary, if Total Order is provided, the implementation of the session 
guarantees introduces an insignificant extra overhead: The amount of consistency 
information that needs to be stored at clients and transferred between clients 
and DCSs is small and does not depend on the number of clients and DCSs in 
the system. 



4.4 Implementing Session Guarantees Without Total Order 

When the Total Order implementation is not employed, an object does not 
have a single version number. In this case its state can only be characterized 
by the version vector that has to be maintained by each DCS. While this does 
not affect the implementation of Read Your Writes and the implementation of 
FIFO of Writes remains almost as simple as in the case of Total Order, the 
implementation of FIFO of Reads becomes more complicated and expensive. In 
addition, an implementation of Session Causality should now be provided. We 
elaborate on the changes in the implementations below: 

FIFO of Writes: As with Total Order, a DCS returns a local update identifier 
to the client that initiates the update, a client remembers only the last local 
identifier and forgets the previous one, and this local identifier is transferred 
to a DCS upon the next update request. The only change is that now the 
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DCS blocks the client and does not assign the npdate reqnest a local identifier 
nntil it receives the referred npdate. If the DCS immediately prodnced a local 
npdate identifier, released the client and left the npdate reqnest in a pending 
state, then all later (nnrelated) npdate reqnests with higher local npdate 
identifiers wonld have to wait nntil this npdate wonld be applied. This is 
a shortcoming of the version vector method which assnmes that npdates 
originated at the same DCS are applied in the order of their local identifiers. 
An appealing alternative to blocking the client is to nse a version vector of 
sliding windows instead of jnst a vector of npdate identifiers. In this solntion 
npdates can sometimes be applied in an order different from that of their 
local identifiers. However, a DCS has to remember the identifiers of the np- 
dates applied ont of order. Therefore, while eliminating nnnecessary delays, 
this solntion reqnires more space and more complicated version manage- 
ment. Moreover, this solntion makes the implementation of FIFO of Reads 
complicated and inefficient. 

FIFO of Reads: Withont Total Order, the simplest implementation of this 
condition is that a DCS transfers the entire version vector to a client along 
with the resnlts of a qnery. The client remembers the version vector it re- 
ceived the last time and forgets the previons vector. This vector is passed 
to a (possibly different) DCS npon the next client query, and the DCS syn- 
chronizes the query with each local identifier in the vector. 

This implementation is inefficient because the entire version vector whose 
length is the number of DCSs in the object hierarchy is sent twice per each 
query. Below we introduce optimizations that allows us to reduce the average 
amount of transferred information. 

Session Causality: Again, the simplest implementation is that a DCS trans- 
fers the entire version vector to a client along with the query results. However, 
unless FIFO of Reads is also provided, it is not sufficient that a client re- 
members only the version vector it received in the previous interaction with 
the DCS. Actually, the client must merge all the vectors it received during 
the execution by computing their maximum. This merged vector is passed 
to a DCS upon the next client update. Furthermore, since every DCS has 
to synchronize this update with this vector, the DCS piggybacks the entire 
vector on the update message sent to other DCSs. In the future, we intend 
to investigate the possibility of using the causal separators technique [19] in 
order to reduce the amount of piggybacked information. This technique ap- 
pears especially appealing due to the hierarchical architecture employed by 
CASCADE in which each intermediate node can act as a causal separator. 

As we see, when Total Order is not employed, the straightforward implemen- 
tations of FIFO of Reads and Session Causality are quite expensive in terms of 
the amount of information to be transferred over the network. Fortunately, the 
implementation of FIFO of Reads can be significantly improved by using the 
optimization that is explained below. 
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Efficient FIFO of Reads Implementation First, rather than sending the 
entire version vector to a client as part of the response to qneries, a DCS can send 
the difference between its cnrrent version vector and the vector received from 
the client for the pnrpose of synchronization This difference is nsnally shorter 
than the entire version vector. For example, the difference of ((T, 1), (5, 3)) and 
((T, 1), (5, 1)) is {{B, 3)). Upon receiving snch a vector difference the client can 
add it to the vector it sent and restore the entire version vector of the DCS in 
its local memory. However, the client shonld still send its entire version vector 
for synchronization. 

Another optimization is based on the following observation: If a client does 
not switch DCSs (in other words, it invokes all npdates and qneries on the same 
DCS), then FIFO of Reads always holds in a trivial way and does not need to 
be implemented at all. Fnrthermore, FIFO of Writes and Session Cansality also 
trivially hold dne to per-DCS FIFO of Writes and per-DCS Session Cansality, 
respectively. This sitnation is snmmarized in Table 1 that clearly shows the cost 
of client mobility. 



Table 1. The implementation cost of session guarantees 



^ - trivially holds 
X - adds extra 
cost 

XX- requires costly 
communication 



Unfortunately, if the consistency implementation is unaware that the client 
continues to work with the same DCS, it transfers the same high amount of 
information as if the client switched DCSs. This observation calls for optimizing 
the implementation for the most usual and frequent case when a client commu- 
nicates with a single DCS. The client can just verify that it invokes a current 
request on the same DCS as the previous one. If this is true, the client does not 
need to send any information for synchronization. 

However, a DCS still has to return its version vector along with the query 
results in order to account for the possibility that a client invokes the next query 
on another DCS. Furthermore, if a client sends no synchronization information, 
we can no longer use the differential optimization described above because a 
DCS has no reference point to compute the difference of vectors. 

Thus, there is a need for synchronization information shorter than just an 
entire version vector. To this end, we introduce a notion of local DCS history 
which is a numbered sequence of update identifiers of all the updates applied at 
the DCS during the execution. A local history pointer is just an index to local 

^ This optimization is similar to Singhal-Kshemkalyani technique [20] for implementing 
vector clocks. 
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DCS history. A DCS can return this pointer to a client, and a client can transfer 
it back to the DCS for synchronization at some later point. As a result, only local 
history pointers and vector differences are transferred over the network instead 
of entire version vectors. 

As part of this optimization, a DCS should be able to compute the differ- 
ence between its current version vector and a pointer to some past point of its 
local history. An important question is how this can be done efficiently with- 
out keeping the whole local history. The full version of this paper provides a 
detailed explanation of the algorithm used in CASCADE that satisfies these re- 
quirements. It also describes a generalization of this optimization for the case 
when a client communicates with several DCSs. This generalization proves to be 
efficient when the number of DCSs is small (which is the usual case as noted in 
Section 4.1). 

5 Future Work 

It would be interesting to arrive at a complete set of basic consistency condi- 
tions. That is, be able to show that any consistency condition can be provided 
as a combination of a subset of these conditions, and that each of this condi- 
tions is necessary for implementing at least one consistency condition. In our 
opinion, this should be made at the application point of view, like our defini- 
tions and the works of [2,4], since such definitions are more rigorous, easier to 
understand, and can be used more easily by programmers to prove correctness 
of their applications. 

As for the implementation, it is possible to implement each of the basic con- 
sistency conditions separately, and then trigger the required ones based on the 
application’s need. We have decided not to follow this path, and to optimize the 
implementation of various conditions based on the other conditions being pro- 
vided, since an independent implementation of each condition was too wasteful 
and slow. Perhaps the right way to tackle this issue is by providing an inde- 
pendent implementation for each condition, and then use a high-level compiler 
to optimize combinations of conditions, similar to the work on automatically 
optimizing and proving group communication protocol stacks in Ensemble [15]. 
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